Thanks to Tilman for pointing out that this data is still being worked on.
So, yes, there are lots of subtleties in how we count articles, redirects,
content vs. non-content, etc.  I don't have the answer to all of the
discrepancies that Tilman found, but if you need a very accurate answer,
the only way is to get an account on labs and start digging into how
exactly you want to count the articles.  As our datasets and APIs get more
mature, we're hoping to give as much flexibility as everyone needs, but not
so much as to drive people crazy.  Until then, we're slowly improving our
docs.

And yes, don't read some of this stuff alone at night, the buddy system
works well for data analysis, lol

On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) <[email protected]>
wrote:

> Thank you very much Dan, this turns out to be very helpful. My teammates
> has started looking into it.
>
>
> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
> * •  *Software Engineer, Data Engine
> * •*  Google Inc.
> * •  *[email protected] <[email protected]> - 650.336.5691
> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>
> On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu <[email protected]>
> wrote:
>
>> Forwarding this question to the public Analytics list, where it's good to
>> have these kinds of discussions.  If you're interested in this data and how
>> it changes over time, do subscribe and watch for updates, notices of
>> outages, etc.
>>
>> Ok, so on to your question.  You'd like the *total # of articles for
>> each wiki*.  I think the simplest way right now is to query the AQS
>> (Analytics Query Service) API, documented here: https://wikitech.wikimed
>> ia.org/wiki/Analytics/AQS/Wikistats_2
>>
>> To get the # of articles for a wiki, let's say en.wikipedia.org, you can
>> get the timeseries of new articles per month since the beginning of time:
>>
>> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
>> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>*
>>
>> And to get a list of all wikis, to plug into that URL instead of "
>> en.wikipedia.org", the most up-to-date information is here:
>> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via
>> the mediawiki API: https://meta.wikimedia.or
>> g/w/api.php?action=sitematrix&formatversion=2&format=json&
>> maxage=3600&smaxage=3600.  Sometimes new sites won't have data in the
>> AQS API for a month or two until we add them and start crunching their
>> stats.
>>
>> The way I figured this out is to look at how our UI uses the API:
>> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages.
>> So if you were interested in something else, you can browse around there
>> and take a look at the XHR requests in the browser console.  Have fun!
>>
>> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
>> [email protected]> wrote:
>>
>>> Hi Dan,
>>>
>>> How are you! This is Victor, It's been a while since we meet at the 2018
>>> Wikimedia Dev Summit. I hope you are doing great.
>>>
>>> As I mentioned to you, my team works on extracting the knowledge from
>>> Wikipedia. Currently it's undergoing a project that expands language
>>> coverage. My teammate Yuan Gao(cc'ed here)  is tech leader of this
>>> project.She plans to *monitor the list of all the current available
>>> wikipedia's sites and the number of articles for each language*, so
>>> that we can compare with our extraction system's output to sanity-check if
>>> there is a massive breakage of the extraction logic, or if we need to
>>> add/remove languages in the event that a new wikipedia site is introduced
>>> to/remove from the wikipedia family.
>>>
>>> I think your team at Analytics at Wikimedia probably knows the best
>>> where we can find this data. Here are 4 places we already know, but doesn't
>>> seem to have the data.
>>>
>>>
>>>    - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
>>>    information we need, but the list is manually edited, not automatic
>>>    - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
>>>    the information seems pretty out of date(last updated almost a month ago)
>>>    - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
>>>    find the full list nor the number of articles
>>>    - API https://wikimedia.org/api/rest_v1/ suggested by elukey on
>>>    #wikimedia-analytics channel, it doesn't seem to have # of article
>>>    information
>>>
>>> Do you know what is a good place to find this information? Thank you!
>>>
>>> Victor
>>>
>>>
>>>
>>> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
>>> * •  *Software Engineer, Data Engine
>>> * •*  Google Inc.
>>> * •  *[email protected] <[email protected]> - 650.336.5691
>>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>>>
>>> ---------- Forwarded message ----------
>>> From: Yuan Gao <[email protected]>
>>> Date: Wed, Mar 28, 2018 at 4:15 PM
>>> Subject: Monitor the number of Wikipedia sites and the number of
>>> articles in each site
>>> To: Zainan Victor Zhou <[email protected]>
>>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]>
>>>
>>>
>>> Hi Victor,
>>> as we discussed in the meeting, I'd like to monitor:
>>> 1) the number of Wikipedia sites
>>> 2) the number of articles in each site
>>>
>>> Can you help us to contact with WMF to get a realtime or at least daily
>>> update of these numbers? What we can find now is
>>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
>>> Wikipedia sites is manually updated, and possibly out-of-date.
>>>
>>>
>>> The monitor can help us catch such bugs.
>>>
>>> --
>>> Yuan Gao
>>>
>>>
>>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to