Hi Tilman,
our team, i.e., the team working on extracting the knowledge from Wikipedia
in Google, has just compared our crawled data with
https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. In the following
sites, we have quite significant diffs:
    Wikipedia site                      # listed in Table
  # from Google crawled data
http://ady.wikipedia.org/             409
  549

http://bjn.wikipedia.org/            1844
 1952

http://bo.wikipeida.org/              5818
 11120

We following the same definition of article
<https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F> to count
Google crawled data. Is there anyway to debug why there is huge diffs?
Take bo.wikipedia.org for example, we tried to crawl all urls listed in
https://bo.wikipedia.org/w/index.php?title=Special:AllPages&hideredirects=1,
but it seems it contains redirect pages, so the total sum of urls
are 16498, not 5818.

On Fri, Mar 30, 2018 at 10:51 AM, Tilman Bayer <[email protected]> wrote:

> Another data source is https://meta.wikimedia.org/
> wiki/List_of_Wikipedias/Table (transcluded in
> https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated
> twice daily by a bot that directly retrieves the numbers as reported in
> each wiki's [[Special:Statistics]] page, and can be considered reliable.
> (I.e. it is using basically the same primary source as
> http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
>
> Two more comments inline below.
>
> On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> Forwarding this question to the public Analytics list, where it's good to
>> have these kinds of discussions.  If you're interested in this data and how
>> it changes over time, do subscribe and watch for updates, notices of
>> outages, etc.
>>
>> Ok, so on to your question.  You'd like the *total # of articles for
>> each wiki*.  I think the simplest way right now is to query the AQS
>> (Analytics Query Service) API, documented here: https://wikitech.wikimed
>> ia.org/wiki/Analytics/AQS/Wikistats_2
>>
>> To get the # of articles for a wiki, let's say en.wikipedia.org, you can
>> get the timeseries of new articles per month since the beginning of time:
>>
>> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
>> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>*
>>
>
> Unless I'm mistaken, summing up these monthly numbers would yield 3.5
> million articles - but English Wikipedia has already over 5 million per
> https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different
> result?
>
> In general, it's worth being aware that there are various subtleties
> involved in defining article counts precisely, as detailed at
> https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too
> much aware, that's not good for your mental health. Seriously, that page is
> a data analyst's version of a horror novel. Don't read it alone at night.)
>
>
>> And to get a list of all wikis, to plug into that URL instead of "
>> en.wikipedia.org", the most up-to-date information is here:
>> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via
>> the mediawiki API: https://meta.wikimedia.or
>> g/w/api.php?action=sitematrix&formatversion=2&format=json&ma
>> xage=3600&smaxage=3600.  Sometimes new sites won't have data in the AQS
>> API for a month or two until we add them and start crunching their stats.
>>
>> The way I figured this out is to look at how our UI uses the API:
>> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages.
>> So if you were interested in something else, you can browse around there
>> and take a look at the XHR requests in the browser console.  Have fun!
>>
>> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
>> [email protected]> wrote:
>>
>>> Hi Dan,
>>>
>>> How are you! This is Victor, It's been a while since we meet at the 2018
>>> Wikimedia Dev Summit. I hope you are doing great.
>>>
>>> As I mentioned to you, my team works on extracting the knowledge from
>>> Wikipedia. Currently it's undergoing a project that expands language
>>> coverage. My teammate Yuan Gao(cc'ed here)  is tech leader of this
>>> project.She plans to *monitor the list of all the current available
>>> wikipedia's sites and the number of articles for each language*, so
>>> that we can compare with our extraction system's output to sanity-check if
>>> there is a massive breakage of the extraction logic, or if we need to
>>> add/remove languages in the event that a new wikipedia site is introduced
>>> to/remove from the wikipedia family.
>>>
>>> I think your team at Analytics at Wikimedia probably knows the best
>>> where we can find this data. Here are 4 places we already know, but doesn't
>>> seem to have the data.
>>>
>>>
>>>    - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
>>>    information we need, but the list is manually edited, not automatic
>>>    - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
>>>    the information seems pretty out of date(last updated almost a month ago)
>>>    - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
>>>    find the full list nor the number of articles
>>>    - API https://wikimedia.org/api/rest_v1/ suggested by elukey on
>>>    #wikimedia-analytics channel, it doesn't seem to have # of article
>>>    information
>>>
>>> Do you know what is a good place to find this information? Thank you!
>>>
>>> Victor
>>>
>>>
>>>
>>> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
>>> * •  *Software Engineer, Data Engine
>>> * •*  Google Inc.
>>> * •  *[email protected] <[email protected]> - 650.336.5691
>>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>>>
>>> ---------- Forwarded message ----------
>>> From: Yuan Gao <[email protected]>
>>> Date: Wed, Mar 28, 2018 at 4:15 PM
>>> Subject: Monitor the number of Wikipedia sites and the number of
>>> articles in each site
>>> To: Zainan Victor Zhou <[email protected]>
>>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]>
>>>
>>>
>>> Hi Victor,
>>> as we discussed in the meeting, I'd like to monitor:
>>> 1) the number of Wikipedia sites
>>> 2) the number of articles in each site
>>>
>>> Can you help us to contact with WMF to get a realtime or at least daily
>>> update of these numbers? What we can find now is
>>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
>>> Wikipedia sites is manually updated, and possibly out-of-date.
>>>
>>>
>>> The monitor can help us catch such bugs.
>>>
>>> --
>>> Yuan Gao
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>



-- 
Yuan Gao
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to