Thanks, Scott, I failed to find that task and incorrectly assumed we had declined it. My fault, we'll see about loading that data then.
And yes, Peter, per-article dumps are already there but they're split across pagecounts-raw from 2008-2011 and pagecounts-ez after that. The conversation before you posted was that we would try to get pagecounts-ez to include all available history on a per-article level. Since pagecounts-ez is the most convenient and fast way to get to this data. On Thu, Feb 22, 2018 at 6:31 PM, Scott Hale <[email protected]> wrote: > FYI that there is a phabricator task to load legacy pagecounts by article > to AQS: > https://phabricator.wikimedia.org/T173720 > > That task arose from a discussion on this mailing list mid-last year: > https://www.mail-archive.com/[email protected]/msg04349.html > https://www.mail-archive.com/[email protected]/msg04350.html > > Cheers, > Scott > > > > On Thu, Feb 22, 2018 at 11:25 PM, Nuria Ruiz <[email protected]> wrote: > >> Peter: >> >> Do submit a phabricator tasks with your request, it'll be easier to >> follow on it than it is via e-mail. Our backlog: https://phabricator.w >> ikimedia.org/tag/analytics/ >> >> I assume you know that per article views are available since 2015, a way >> to see those: https://tools.wmflabs.org/pageviews/ >> >> Per project views are available since early on, in either downloadable >> files or programatic form: https://wikitech.wikimed >> ia.org/wiki/Analytics/AQS/Legacy_Pagecounts >> >> Thanks, >> >> Nuria >> >> On Thu, Feb 22, 2018 at 1:44 PM, Peter Meissner <[email protected] >> > wrote: >> >>> Like dumps on article-day level? That would be already super awesome >>> much better than the current state. >>> >>> Best, Peter >>> >>> Am 22.02.2018 22:23 schrieb "Dan Andreescu" <[email protected]>: >>> >>>> Peter, the data you mention here is quite large, and storage is cheap >>>> but not free. For now, we don't have capacity to serve that kind of >>>> timespan from the API, but we will work to improve the dumps version so >>>> it's more comprehensive. >>>> >>>> On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner < >>>> [email protected]> wrote: >>>> >>>>> Dear List-eners, >>>>> >>>>> >>>>> I write in to argue the case for an Wikipedia effort to make something >>>>> like stats.grok.se (page views per day per article from 2007 onwards) >>>>> available again. >>>>> >>>>> >>>>> I am author of the first R-package that was providing easy access to >>>>> pageview counts by accessing the stats.grok.se service and >>>>> translating the it into need little R data frames. >>>>> >>>>> Since stats.grok.se is gone somebody writes in once a month - mostly >>>>> from academia - asking about the status of page view data for the time >>>>> before late 2015 - counts, per article, per day. To underline this >>>>> further: >>>>> the R pageviews package written by one of your former colleagues has over >>>>> 7000 downloads within 2 years while my package has 14000 within 4 years >>>>> (which are conservative numbers because they stem from one particular CRAN >>>>> mirror only). >>>>> >>>>> I made some efforts to reconstruct the service that stats.grok.se was >>>>> providing but well it's not a trivial endeavour as far as I can see (BIG >>>>> data, demanding some computing time and storage resources and bandwidth, >>>>> and some thinking about how to re-arrange and aggregate the data so it can >>>>> be queried and served efficiently - not to mention that the data is raw >>>>> meaning it needs some proper cleaning up before using, also hosting will >>>>> need some resources, ...) - and so my efforts have gone nowhere . >>>>> >>>>> >>>>> Would it not be nice if Wikipedia could jump in and support research >>>>> by going the whole mile and making those page counts available? >>>>> >>>>> In regard to the prioritizing - I am sure you have a long backlog - I >>>>> would argue that this is something that really is a multiplier thing. It >>>>> enables a lot of people to start researching. Daily page counts are not >>>>> that fancy but without them people are simply blocked. They cannot start >>>>> because they cant even get a basic idea about what was the general article >>>>> popularity for a given day. >>>>> >>>>> >>>>> Best Peter >>>>> >>>>> >>>>> >>>>> PS.: I would be willing to put in some time to help you folks in any >>>>> way I can. >>>>> >>>>> >>>>> 2018-02-22 21:56 GMT+01:00 Dan Andreescu <[email protected]>: >>>>> >>>>>> My view had been informed by the documentation at >>>>>>> https://dumps.wikimedia.org/other/pagecounts-ez/: >>>>>>> >>>>>>> Hourly page views per article for around 30 million article titles >>>>>>>> (Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme >>>>>>>> shrinkage, without losing granularity), corrected, reformatted. Daily >>>>>>>> files >>>>>>>> and two monthly files (see notes below). >>>>>>> >>>>>>> >>>>>>> Regarding the claim that pagecounts-ez has data back to when >>>>>>> wikimedia started tracking pageviews, I'll point out another error in >>>>>>> the >>>>>>> documentation that may have led to that view. The documentation claims >>>>>>> that >>>>>>> data is available from 2007 onward: >>>>>>> >>>>>>> From 2007 to May 2015: derived from Domas' pagecount/projectcount >>>>>>>> files >>>>>>> >>>>>>> >>>>>>> However, if you check out the actual files ( >>>>>>> https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll >>>>>>> see that the pagecounts only go back to late 2011. >>>>>>> >>>>>> >>>>>> Ah, yes, but the projectcount files go back to 2007-12, that's where >>>>>> that confusion comes from, we should clarify or generate the old data. >>>>>> I'm >>>>>> not sure whether this is easy, but I think it's fairly straightforward >>>>>> and >>>>>> I've opened a task for it: https://phabricator.wikimedia.org/T188041 >>>>>> (we have a lot of work in our backlog, though, so we probably won't be >>>>>> able >>>>>> to get to this for a bit). >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Dr Scott A. Hale > http://scott.hale.us > [email protected] > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
