Another data source is https://meta.wikimedia.org/
wiki/List_of_Wikipedias/Table (transcluded in https://meta.wikimedia.org/wik
i/List_of_Wikipedias ), which is updated twice daily by a bot that directly
retrieves the numbers as reported in each wiki's [[Special:Statistics]]
page, and can be considered reliable. (I.e. it is using basically the same
primary source as http://wikistats.wmflabs.org/display.php?t=wp , the tool
Ahmed mentioned.)

Two more comments inline below.

On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu <[email protected]>
wrote:

> Forwarding this question to the public Analytics list, where it's good to
> have these kinds of discussions.  If you're interested in this data and how
> it changes over time, do subscribe and watch for updates, notices of
> outages, etc.
>
> Ok, so on to your question.  You'd like the *total # of articles for each
> wiki*.  I think the simplest way right now is to query the AQS (Analytics
> Query Service) API, documented here: https://wikitech.wikimed
> ia.org/wiki/Analytics/AQS/Wikistats_2
>
> To get the # of articles for a wiki, let's say en.wikipedia.org, you can
> get the timeseries of new articles per month since the beginning of time:
>
> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>*
>

Unless I'm mistaken, summing up these monthly numbers would yield 3.5
million articles - but English Wikipedia has already over 5 million per
https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different
result?

In general, it's worth being aware that there are various subtleties
involved in defining article counts precisely, as detailed at
https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too
much aware, that's not good for your mental health. Seriously, that page is
a data analyst's version of a horror novel. Don't read it alone at night.)


> And to get a list of all wikis, to plug into that URL instead of "
> en.wikipedia.org", the most up-to-date information is here:
> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via
> the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&;
> formatversion=2&format=json&maxage=3600&smaxage=3600.  Sometimes new
> sites won't have data in the AQS API for a month or two until we add them
> and start crunching their stats.
>
> The way I figured this out is to look at how our UI uses the API:
> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages.
> So if you were interested in something else, you can browse around there
> and take a look at the XHR requests in the browser console.  Have fun!
>
> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
> [email protected]> wrote:
>
>> Hi Dan,
>>
>> How are you! This is Victor, It's been a while since we meet at the 2018
>> Wikimedia Dev Summit. I hope you are doing great.
>>
>> As I mentioned to you, my team works on extracting the knowledge from
>> Wikipedia. Currently it's undergoing a project that expands language
>> coverage. My teammate Yuan Gao(cc'ed here)  is tech leader of this
>> project.She plans to *monitor the list of all the current available
>> wikipedia's sites and the number of articles for each language*, so that
>> we can compare with our extraction system's output to sanity-check if there
>> is a massive breakage of the extraction logic, or if we need to add/remove
>> languages in the event that a new wikipedia site is introduced to/remove
>> from the wikipedia family.
>>
>> I think your team at Analytics at Wikimedia probably knows the best where
>> we can find this data. Here are 4 places we already know, but doesn't seem
>> to have the data.
>>
>>
>>    - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
>>    information we need, but the list is manually edited, not automatic
>>    - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
>>    the information seems pretty out of date(last updated almost a month ago)
>>    - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
>>    find the full list nor the number of articles
>>    - API https://wikimedia.org/api/rest_v1/ suggested by elukey on
>>    #wikimedia-analytics channel, it doesn't seem to have # of article
>>    information
>>
>> Do you know what is a good place to find this information? Thank you!
>>
>> Victor
>>
>>
>>
>> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
>> * •  *Software Engineer, Data Engine
>> * •*  Google Inc.
>> * •  *[email protected] <[email protected]> - 650.336.5691
>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>>
>> ---------- Forwarded message ----------
>> From: Yuan Gao <[email protected]>
>> Date: Wed, Mar 28, 2018 at 4:15 PM
>> Subject: Monitor the number of Wikipedia sites and the number of articles
>> in each site
>> To: Zainan Victor Zhou <[email protected]>
>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]>
>>
>>
>> Hi Victor,
>> as we discussed in the meeting, I'd like to monitor:
>> 1) the number of Wikipedia sites
>> 2) the number of articles in each site
>>
>> Can you help us to contact with WMF to get a realtime or at least daily
>> update of these numbers? What we can find now is
>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
>> Wikipedia sites is manually updated, and possibly out-of-date.
>>
>>
>> The monitor can help us catch such bugs.
>>
>> --
>> Yuan Gao
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to