Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again. I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames. Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only). I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere . Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available? In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day. Best Peter PS.: I would be willing to put in some time to help you folks in any way I can. 2018-02-22 21:56 GMT+01:00 Dan Andreescu <[email protected]>: > My view had been informed by the documentation at >> https://dumps.wikimedia.org/other/pagecounts-ez/: >> >> Hourly page views per article for around 30 million article titles (Sept >>> 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage, >>> without losing granularity), corrected, reformatted. Daily files and two >>> monthly files (see notes below). >> >> >> Regarding the claim that pagecounts-ez has data back to when wikimedia >> started tracking pageviews, I'll point out another error in the >> documentation that may have led to that view. The documentation claims that >> data is available from 2007 onward: >> >> From 2007 to May 2015: derived from Domas' pagecount/projectcount files >> >> >> However, if you check out the actual files (https://dumps.wikimedia.org/o >> ther/pagecounts-ez/merged/), you'll see that the pagecounts only go back >> to late 2011. >> > > Ah, yes, but the projectcount files go back to 2007-12, that's where that > confusion comes from, we should clarify or generate the old data. I'm not > sure whether this is easy, but I think it's fairly straightforward and I've > opened a task for it: https://phabricator.wikimedia.org/T188041 (we have > a lot of work in our backlog, though, so we probably won't be able to get > to this for a bit). > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
