Zainan: Labs is our cloud environment for volunteers, you can direct questions about that to cloud e-mail list.
https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction Thanks, Nuria On Mon, Apr 2, 2018 at 7:44 PM, Zainan Zhou (a.k.a Victor) <[email protected]> wrote: > Thanks Dan, that's very helpful, I asked two follow-up questions inline > below > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *[email protected] <[email protected]> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 > > On Sat, Mar 31, 2018 at 12:34 AM, Dan Andreescu <[email protected]> > wrote: > >> Thanks to Tilman for pointing out that this data is still being worked >> on. So, yes, there are lots of subtleties in how we count articles, >> redirects, content vs. non-content, etc. I don't have the answer to all of >> the discrepancies that Tilman found, but if you need a very accurate >> answer, the only way is to get an account on labs and start digging into >> how exactly you want to count the articles. >> > > What's the best way to signup the labs account? (does it require certain > qualifications?) > And could you point us to the code or entry of the code repository? > > > >> As our datasets and APIs get more mature, we're hoping to give as much >> flexibility as everyone needs, but not so much as to drive people crazy. >> Until then, we're slowly improving our docs. >> >> And yes, don't read some of this stuff alone at night, the buddy system >> works well for data analysis, lol >> >> On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) < >> [email protected]> wrote: >> >>> Thank you very much Dan, this turns out to be very helpful. My teammates >>> has started looking into it. >>> >>> >>> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> >>> * • *Software Engineer, Data Engine >>> * •* Google Inc. >>> * • *[email protected] <[email protected]> - 650.336.5691 >>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 >>> >>> On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu <[email protected] >>> > wrote: >>> >>>> Forwarding this question to the public Analytics list, where it's good >>>> to have these kinds of discussions. If you're interested in this data and >>>> how it changes over time, do subscribe and watch for updates, notices of >>>> outages, etc. >>>> >>>> Ok, so on to your question. You'd like the *total # of articles for >>>> each wiki*. I think the simplest way right now is to query the AQS >>>> (Analytics Query Service) API, documented here: >>>> https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 >>>> >>>> To get the # of articles for a wiki, let's say en.wikipedia.org, you >>>> can get the timeseries of new articles per month since the beginning of >>>> time: >>>> >>>> *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 >>>> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>* >>>> >>>> And to get a list of all wikis, to plug into that URL instead of " >>>> en.wikipedia.org", the most up-to-date information is here: >>>> https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or >>>> via the mediawiki API: https://meta.wikimedia.or >>>> g/w/api.php?action=sitematrix&formatversion=2&format=json&ma >>>> xage=3600&smaxage=3600. Sometimes new sites won't have data in the >>>> AQS API for a month or two until we add them and start crunching their >>>> stats. >>>> >>>> The way I figured this out is to look at how our UI uses the API: >>>> https://stats.wikimedia.org/v2/#/en.wikipedia.org/contr >>>> ibuting/new-pages. So if you were interested in something else, you >>>> can browse around there and take a look at the XHR requests in the browser >>>> console. Have fun! >>>> >>>> On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < >>>> [email protected]> wrote: >>>> >>>>> Hi Dan, >>>>> >>>>> How are you! This is Victor, It's been a while since we meet at the >>>>> 2018 Wikimedia Dev Summit. I hope you are doing great. >>>>> >>>>> As I mentioned to you, my team works on extracting the knowledge from >>>>> Wikipedia. Currently it's undergoing a project that expands language >>>>> coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this >>>>> project.She plans to *monitor the list of all the current available >>>>> wikipedia's sites and the number of articles for each language*, so >>>>> that we can compare with our extraction system's output to sanity-check if >>>>> there is a massive breakage of the extraction logic, or if we need to >>>>> add/remove languages in the event that a new wikipedia site is introduced >>>>> to/remove from the wikipedia family. >>>>> >>>>> I think your team at Analytics at Wikimedia probably knows the best >>>>> where we can find this data. Here are 4 places we already know, but >>>>> doesn't >>>>> seem to have the data. >>>>> >>>>> >>>>> - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the >>>>> information we need, but the list is manually edited, not automatic >>>>> - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, >>>>> but the information seems pretty out of date(last updated almost a >>>>> month >>>>> ago) >>>>> - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I >>>>> can't find the full list nor the number of articles >>>>> - API https://wikimedia.org/api/rest_v1/ suggested by elukey on >>>>> #wikimedia-analytics channel, it doesn't seem to have # of article >>>>> information >>>>> >>>>> Do you know what is a good place to find this information? Thank you! >>>>> >>>>> Victor >>>>> >>>>> >>>>> >>>>> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> >>>>> * • *Software Engineer, Data Engine >>>>> * •* Google Inc. >>>>> * • *[email protected] <[email protected]> - 650.336.5691 >>>>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: Yuan Gao <[email protected]> >>>>> Date: Wed, Mar 28, 2018 at 4:15 PM >>>>> Subject: Monitor the number of Wikipedia sites and the number of >>>>> articles in each site >>>>> To: Zainan Victor Zhou <[email protected]> >>>>> Cc: Wenjie Song <[email protected]>, WikiData <[email protected]> >>>>> >>>>> >>>>> Hi Victor, >>>>> as we discussed in the meeting, I'd like to monitor: >>>>> 1) the number of Wikipedia sites >>>>> 2) the number of articles in each site >>>>> >>>>> Can you help us to contact with WMF to get a realtime or at least >>>>> daily update of these numbers? What we can find now is >>>>> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of >>>>> Wikipedia sites is manually updated, and possibly out-of-date. >>>>> >>>>> >>>>> The monitor can help us catch such bugs. >>>>> >>>>> -- >>>>> Yuan Gao >>>>> >>>>> >>>> >>> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
