Hi Tilman, our team, i.e., the team working on extracting the knowledge from Wikipedia in Google, has just compared our crawled data with https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. In the following sites, we have quite significant diffs: Wikipedia site # listed in Table # from Google crawled data http://ady.wikipedia.org/ 409 549
http://bjn.wikipedia.org/ 1844 1952 http://bo.wikipeida.org/ 5818 11120 We following the same definition of article <https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F> to count Google crawled data. Is there anyway to debug why there is huge diffs? Take bo.wikipedia.org for example, we tried to crawl all urls listed in https://bo.wikipedia.org/w/index.php?title=Special:AllPages&hideredirects=1, but it seems it contains redirect pages, so the total sum of urls are 16498, not 5818. On Fri, Mar 30, 2018 at 10:51 AM, Tilman Bayer <[email protected]> wrote: > Another data source is https://meta.wikimedia.org/ > wiki/List_of_Wikipedias/Table (transcluded in > https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated > twice daily by a bot that directly retrieves the numbers as reported in > each wiki's [[Special:Statistics]] page, and can be considered reliable. > (I.e. it is using basically the same primary source as > http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.) > > Two more comments inline below. > > On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu <[email protected]> > wrote: > >> Forwarding this question to the public Analytics list, where it's good to >> have these kinds of discussions. If you're interested in this data and how >> it changes over time, do subscribe and watch for updates, notices of >> outages, etc. >> >> Ok, so on to your question. You'd like the *total # of articles for >> each wiki*. I think the simplest way right now is to query the AQS >> (Analytics Query Service) API, documented here: https://wikitech.wikimed >> ia.org/wiki/Analytics/AQS/Wikistats_2 >> >> To get the # of articles for a wiki, let's say en.wikipedia.org, you can >> get the timeseries of new articles per month since the beginning of time: >> >> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 >> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>* >> > > Unless I'm mistaken, summing up these monthly numbers would yield 3.5 > million articles - but English Wikipedia has already over 5 million per > https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different > result? > > In general, it's worth being aware that there are various subtleties > involved in defining article counts precisely, as detailed at > https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too > much aware, that's not good for your mental health. Seriously, that page is > a data analyst's version of a horror novel. Don't read it alone at night.) > > >> And to get a list of all wikis, to plug into that URL instead of " >> en.wikipedia.org", the most up-to-date information is here: >> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via >> the mediawiki API: https://meta.wikimedia.or >> g/w/api.php?action=sitematrix&formatversion=2&format=json&ma >> xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS >> API for a month or two until we add them and start crunching their stats. >> >> The way I figured this out is to look at how our UI uses the API: >> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. >> So if you were interested in something else, you can browse around there >> and take a look at the XHR requests in the browser console. Have fun! >> >> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < >> [email protected]> wrote: >> >>> Hi Dan, >>> >>> How are you! This is Victor, It's been a while since we meet at the 2018 >>> Wikimedia Dev Summit. I hope you are doing great. >>> >>> As I mentioned to you, my team works on extracting the knowledge from >>> Wikipedia. Currently it's undergoing a project that expands language >>> coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this >>> project.She plans to *monitor the list of all the current available >>> wikipedia's sites and the number of articles for each language*, so >>> that we can compare with our extraction system's output to sanity-check if >>> there is a massive breakage of the extraction logic, or if we need to >>> add/remove languages in the event that a new wikipedia site is introduced >>> to/remove from the wikipedia family. >>> >>> I think your team at Analytics at Wikimedia probably knows the best >>> where we can find this data. Here are 4 places we already know, but doesn't >>> seem to have the data. >>> >>> >>> - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the >>> information we need, but the list is manually edited, not automatic >>> - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but >>> the information seems pretty out of date(last updated almost a month ago) >>> - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't >>> find the full list nor the number of articles >>> - API https://wikimedia.org/api/rest_v1/ suggested by elukey on >>> #wikimedia-analytics channel, it doesn't seem to have # of article >>> information >>> >>> Do you know what is a good place to find this information? Thank you! >>> >>> Victor >>> >>> >>> >>> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> >>> * • *Software Engineer, Data Engine >>> * •* Google Inc. >>> * • *[email protected] <[email protected]> - 650.336.5691 >>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 >>> >>> ---------- Forwarded message ---------- >>> From: Yuan Gao <[email protected]> >>> Date: Wed, Mar 28, 2018 at 4:15 PM >>> Subject: Monitor the number of Wikipedia sites and the number of >>> articles in each site >>> To: Zainan Victor Zhou <[email protected]> >>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]> >>> >>> >>> Hi Victor, >>> as we discussed in the meeting, I'd like to monitor: >>> 1) the number of Wikipedia sites >>> 2) the number of articles in each site >>> >>> Can you help us to contact with WMF to get a realtime or at least daily >>> update of these numbers? What we can find now is >>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of >>> Wikipedia sites is manually updated, and possibly out-of-date. >>> >>> >>> The monitor can help us catch such bugs. >>> >>> -- >>> Yuan Gao >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB > -- Yuan Gao
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
