John, Assuming you find non-existing pages by title, you can query up to 500 titles in a single query: [0] Which means you would need only 200 requests to get 100K titles. Do those request serially (not in parallel) and I doubt you will hit any rate limit, while being conscious about server limits. You were probably hitting multiple 404's in parallel which is not ideal.
As an alternative, creating a tool [1] inside our infrastructure that generates daily dumps of all page titles is also a possibility maybe people could find interesting. After all it would be a slow, but unique SQL query per day. Here a query that would work (note you would get the titles incoded in Mediawiki format [2] ). [0] <https://www.mediawiki.org/wiki/API:Query#Specifying_pages> [1] <https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas> [2] SELECT page_title FROM enwiki_p.page WHERE page_namespace = 0 On Wed, Feb 27, 2019 at 3:01 PM John Bohannon <[email protected]> wrote: > > > Hello! > > I'm hoping to get advice on how we should approach the following challenge... > > I am building a public website that will provide information that is > automatically harvested from online news articles about the work of > scientists. The goal is to make it easier to create and maintain scientific > content on Wikipedia. > > Here's some news about the project: > https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer > > And here is the prototype of the site: https://quicksilver.primer.ai > > What I am working on now is a self-updating version of this site. > > The goal is to provide daily refreshed information for scientists most likely > to be missing from Wikipedia. > > For now I am focusing on English-language news and English-language > Wikipedia. Eventually this will expand to other languages. > > The ~100 scientists shown on any given day are selected from ~100k > scientists that the system is tracking for news updates. > > So here's the challenge: > > To choose the 100 scientists most in need of an update on Wikipedia, we need > to query Wikipedia each day for the 100k scientists to see if they have an > article yet, and if so to get its content (to check if we have new > information). > > I am getting throttled by the Wikipedia servers. 100k is a lot of queries. > > What is the most polite, sanctioned method for programmatic access to > Wikipedia for a daily job on this scale? > > Many thanks for help/advice! > > John Bohannon > http://johnbohannon.org > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics -- Jaime Crespo <http://wikimedia.org> _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
