Using the dumps https://meta.wikimedia.org/wiki/Data_dumps is the best way to go through that many pages daily.
Chico Venancio (+55 98) 9 8800 2743 Em qua, 27 de fev de 2019 às 11:01, John Bohannon <[email protected]> escreveu: > > Hello! > > I'm hoping to get advice on how we should approach the following > *challenge*... > > I am building a public website that will provide information that is > automatically harvested from online news articles about the work of > scientists. The goal is to make it easier to create and maintain scientific > content on Wikipedia. > > Here's some news about the project: > https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer > > And here is the prototype of the site: https://quicksilver.primer.ai > > What I am working on now is a self-updating version of this site. > > The goal is to provide daily refreshed information for scientists most > likely to be missing from Wikipedia. > > For now I am focusing on English-language news and English-language > Wikipedia. Eventually this will expand to other languages. > > The ~100 scientists shown on any given day are selected from ~100k > scientists that the system is tracking for news updates. > > So here's the *challenge*: > > To choose the 100 scientists most in need of an update on Wikipedia, we > need to query Wikipedia each day for the 100k scientists to see if they > have an article yet, and if so to get its content (to check if we have new > information). > > I am getting throttled by the Wikipedia servers. 100k is a lot of queries. > > What is the most polite, sanctioned method for programmatic access to > Wikipedia for a daily job on this scale? > > Many thanks for help/advice! > > John Bohannon > http://johnbohannon.org > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
