Thanks Dan, that's very helpful, I asked two follow-up questions inline below
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> * • *Software Engineer, Data Engine * •* Google Inc. * • *[email protected] <[email protected]> - 650.336.5691 * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 On Sat, Mar 31, 2018 at 12:34 AM, Dan Andreescu <[email protected]> wrote: > Thanks to Tilman for pointing out that this data is still being worked > on. So, yes, there are lots of subtleties in how we count articles, > redirects, content vs. non-content, etc. I don't have the answer to all of > the discrepancies that Tilman found, but if you need a very accurate > answer, the only way is to get an account on labs and start digging into > how exactly you want to count the articles. > What's the best way to signup the labs account? (does it require certain qualifications?) And could you point us to the code or entry of the code repository? > As our datasets and APIs get more mature, we're hoping to give as much > flexibility as everyone needs, but not so much as to drive people crazy. > Until then, we're slowly improving our docs. > > And yes, don't read some of this stuff alone at night, the buddy system > works well for data analysis, lol > > On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) < > [email protected]> wrote: > >> Thank you very much Dan, this turns out to be very helpful. My teammates >> has started looking into it. >> >> >> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> >> * • *Software Engineer, Data Engine >> * •* Google Inc. >> * • *[email protected] <[email protected]> - 650.336.5691 >> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 >> >> On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu <[email protected]> >> wrote: >> >>> Forwarding this question to the public Analytics list, where it's good >>> to have these kinds of discussions. If you're interested in this data and >>> how it changes over time, do subscribe and watch for updates, notices of >>> outages, etc. >>> >>> Ok, so on to your question. You'd like the *total # of articles for >>> each wiki*. I think the simplest way right now is to query the AQS >>> (Analytics Query Service) API, documented here: https://wikitech.wikimed >>> ia.org/wiki/Analytics/AQS/Wikistats_2 >>> >>> To get the # of articles for a wiki, let's say en.wikipedia.org, you >>> can get the timeseries of new articles per month since the beginning of >>> time: >>> >>> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 >>> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>* >>> >>> And to get a list of all wikis, to plug into that URL instead of " >>> en.wikipedia.org", the most up-to-date information is here: >>> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via >>> the mediawiki API: https://meta.wikimedia.or >>> g/w/api.php?action=sitematrix&formatversion=2&format=json&ma >>> xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS >>> API for a month or two until we add them and start crunching their stats. >>> >>> The way I figured this out is to look at how our UI uses the API: >>> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. >>> So if you were interested in something else, you can browse around there >>> and take a look at the XHR requests in the browser console. Have fun! >>> >>> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < >>> [email protected]> wrote: >>> >>>> Hi Dan, >>>> >>>> How are you! This is Victor, It's been a while since we meet at the >>>> 2018 Wikimedia Dev Summit. I hope you are doing great. >>>> >>>> As I mentioned to you, my team works on extracting the knowledge from >>>> Wikipedia. Currently it's undergoing a project that expands language >>>> coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this >>>> project.She plans to *monitor the list of all the current available >>>> wikipedia's sites and the number of articles for each language*, so >>>> that we can compare with our extraction system's output to sanity-check if >>>> there is a massive breakage of the extraction logic, or if we need to >>>> add/remove languages in the event that a new wikipedia site is introduced >>>> to/remove from the wikipedia family. >>>> >>>> I think your team at Analytics at Wikimedia probably knows the best >>>> where we can find this data. Here are 4 places we already know, but doesn't >>>> seem to have the data. >>>> >>>> >>>> - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the >>>> information we need, but the list is manually edited, not automatic >>>> - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, >>>> but the information seems pretty out of date(last updated almost a month >>>> ago) >>>> - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I >>>> can't find the full list nor the number of articles >>>> - API https://wikimedia.org/api/rest_v1/ suggested by elukey on >>>> #wikimedia-analytics channel, it doesn't seem to have # of article >>>> information >>>> >>>> Do you know what is a good place to find this information? Thank you! >>>> >>>> Victor >>>> >>>> >>>> >>>> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> >>>> * • *Software Engineer, Data Engine >>>> * •* Google Inc. >>>> * • *[email protected] <[email protected]> - 650.336.5691 >>>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 >>>> >>>> ---------- Forwarded message ---------- >>>> From: Yuan Gao <[email protected]> >>>> Date: Wed, Mar 28, 2018 at 4:15 PM >>>> Subject: Monitor the number of Wikipedia sites and the number of >>>> articles in each site >>>> To: Zainan Victor Zhou <[email protected]> >>>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]> >>>> >>>> >>>> Hi Victor, >>>> as we discussed in the meeting, I'd like to monitor: >>>> 1) the number of Wikipedia sites >>>> 2) the number of articles in each site >>>> >>>> Can you help us to contact with WMF to get a realtime or at least daily >>>> update of these numbers? What we can find now is >>>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of >>>> Wikipedia sites is manually updated, and possibly out-of-date. >>>> >>>> >>>> The monitor can help us catch such bugs. >>>> >>>> -- >>>> Yuan Gao >>>> >>>> >>> >> >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
