Re: [Analytics] Wikipedia throttling

Chico Venancio Wed, 27 Feb 2019 06:06:36 -0800

Using the dumps https://meta.wikimedia.org/wiki/Data_dumps is the best way
to go through that many pages daily.


Chico Venancio
(+55 98) 9 8800 2743


Em qua, 27 de fev de 2019 às 11:01, John Bohannon <[email protected]>
escreveu:

>
> Hello!
>
> I'm hoping to get advice on how we should approach the following
> *challenge*...
>
> I am building a public website that will provide information that is
> automatically harvested from online news articles about the work of
> scientists. The goal is to make it easier to create and maintain scientific
> content on Wikipedia.
>
> Here's some news about the project:
> https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer
>
> And here is the prototype of the site:  https://quicksilver.primer.ai
>
> What I am working on now is a self-updating version of this site.
>
> The goal is to provide daily refreshed information for scientists most
> likely to be missing from Wikipedia.
>
> For now I am focusing on English-language news and English-language
> Wikipedia. Eventually this will expand to other languages.
>
> The  ~100 scientists shown on any given day are selected from ~100k
> scientists that the system is tracking for news updates.
>
> So here's the *challenge*:
>
> To choose the 100 scientists most in need of an update on Wikipedia, we
> need to query Wikipedia each day for the 100k scientists to see if they
> have an article yet, and if so to get its content (to check if we have new
> information).
>
> I am getting throttled by the Wikipedia servers. 100k is a lot of queries.
>
> What is the most polite, sanctioned method for programmatic access to
> Wikipedia for a daily job on this scale?
>
> Many thanks for help/advice!
>
> John Bohannon
> http://johnbohannon.org
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikipedia throttling

Reply via email to