Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner <retep.meiss...@gmail.com> wrote: > Dear List-eners, > > > I write in to argue the case for an Wikipedia effort to make something > like stats.grok.se (page views per day per article from 2007 onwards) > available again. > > > I am author of the first R-package that was providing easy access to > pageview counts by accessing the stats.grok.se service and translating > the it into need little R data frames. > > Since stats.grok.se is gone somebody writes in once a month - mostly from > academia - asking about the status of page view data for the time before > late 2015 - counts, per article, per day. To underline this further: the R > pageviews package written by one of your former colleagues has over 7000 > downloads within 2 years while my package has 14000 within 4 years (which > are conservative numbers because they stem from one particular CRAN mirror > only). > > I made some efforts to reconstruct the service that stats.grok.se was > providing but well it's not a trivial endeavour as far as I can see (BIG > data, demanding some computing time and storage resources and bandwidth, > and some thinking about how to re-arrange and aggregate the data so it can > be queried and served efficiently - not to mention that the data is raw > meaning it needs some proper cleaning up before using, also hosting will > need some resources, ...) - and so my efforts have gone nowhere . > > > Would it not be nice if Wikipedia could jump in and support research by > going the whole mile and making those page counts available? > > In regard to the prioritizing - I am sure you have a long backlog - I > would argue that this is something that really is a multiplier thing. It > enables a lot of people to start researching. Daily page counts are not > that fancy but without them people are simply blocked. They cannot start > because they cant even get a basic idea about what was the general article > popularity for a given day. > > > Best Peter > > > > PS.: I would be willing to put in some time to help you folks in any way I > can. > > > 2018-02-22 21:56 GMT+01:00 Dan Andreescu <dandree...@wikimedia.org>: > >> My view had been informed by the documentation at >>> https://dumps.wikimedia.org/other/pagecounts-ez/: >>> >>> Hourly page views per article for around 30 million article titles (Sept >>>> 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage, >>>> without losing granularity), corrected, reformatted. Daily files and two >>>> monthly files (see notes below). >>> >>> >>> Regarding the claim that pagecounts-ez has data back to when wikimedia >>> started tracking pageviews, I'll point out another error in the >>> documentation that may have led to that view. The documentation claims that >>> data is available from 2007 onward: >>> >>> From 2007 to May 2015: derived from Domas' pagecount/projectcount files >>> >>> >>> However, if you check out the actual files ( >>> https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see >>> that the pagecounts only go back to late 2011. >>> >> >> Ah, yes, but the projectcount files go back to 2007-12, that's where that >> confusion comes from, we should clarify or generate the old data. I'm not >> sure whether this is easy, but I think it's fairly straightforward and I've >> opened a task for it: https://phabricator.wikimedia.org/T188041 (we have >> a lot of work in our backlog, though, so we probably won't be able to get >> to this for a bit). >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics