Re: [Analytics] Wikipedia throttling

Jaime Crespo Wed, 27 Feb 2019 07:41:47 -0800

John,

Assuming you find non-existing pages by title, you can query up to 500
titles in a single query: [0] Which means you would need only 200
requests to get 100K titles. Do those request serially (not in
parallel) and I doubt you will hit any rate limit, while being
conscious about server limits. You were probably hitting multiple
404's in parallel which is not ideal.


As an alternative, creating a tool [1] inside our infrastructure that
generates daily dumps of all page titles is also a possibility maybe
people could find interesting. After all it would be a slow, but
unique SQL query per day. Here a query that would work (note you would
get the titles incoded in Mediawiki format [2] ).

[0] <https://www.mediawiki.org/wiki/API:Query#Specifying_pages>
[1] <https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas>
[2] SELECT page_title FROM enwiki_p.page WHERE page_namespace = 0

On Wed, Feb 27, 2019 at 3:01 PM John Bohannon <[email protected]> wrote:
>
>
> Hello!
>
> I'm hoping to get advice on how we should approach the following challenge...
>
> I am building a public website that will provide information that is 
> automatically harvested from online news articles about the work of 
> scientists. The goal is to make it easier to create and maintain scientific 
> content on Wikipedia.
>
> Here's some news about the project: 
> https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer
>
> And here is the prototype of the site:  https://quicksilver.primer.ai
>
> What I am working on now is a self-updating version of this site.
>
> The goal is to provide daily refreshed information for scientists most likely 
> to be missing from Wikipedia.
>
> For now I am focusing on English-language news and English-language 
> Wikipedia. Eventually this will expand to other languages.
>
> The  ~100 scientists shown on any given day are selected from ~100k 
> scientists that the system is tracking for news updates.
>
> So here's the challenge:
>
> To choose the 100 scientists most in need of an update on Wikipedia, we need 
> to query Wikipedia each day for the 100k scientists to see if they have an 
> article yet, and if so to get its content (to check if we have new 
> information).
>
> I am getting throttled by the Wikipedia servers. 100k is a lot of queries.
>
> What is the most polite, sanctioned method for programmatic access to 
> Wikipedia for a daily job on this scale?
>
> Many thanks for help/advice!
>
> John Bohannon
> http://johnbohannon.org
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
Jaime Crespo
<http://wikimedia.org>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikipedia throttling

Reply via email to