Dear List-eners,

I write in to argue the case for an Wikipedia effort to make something like
stats.grok.se (page views per day per article from 2007 onwards) available
again.


I am author of the first R-package that was providing easy access to
pageview counts by accessing the stats.grok.se service and translating the
it into need little R data frames.

Since stats.grok.se is gone somebody writes in once a month - mostly from
academia - asking about the status of page view data for the time before
late 2015 - counts, per article, per day. To underline this further: the R
pageviews package written by one of your former colleagues has over 7000
downloads within 2 years while my package has 14000 within 4 years (which
are conservative numbers because they stem from one particular CRAN mirror
only).

I made some efforts to reconstruct the service that stats.grok.se was
providing but well it's not a trivial endeavour as far as I can see (BIG
data, demanding some computing time and storage resources and bandwidth,
and some thinking about how to re-arrange and aggregate the data so it can
be queried and served efficiently -  not to mention that the data is raw
meaning it needs some proper cleaning up before using, also hosting will
need some resources, ...) - and so my efforts have gone nowhere .


Would it not be nice if Wikipedia could jump in and support research by
going the whole mile and making those page counts available?

In regard to the prioritizing - I am sure you have a long backlog - I would
argue that this is something that really is a multiplier thing. It enables
a lot of people to start researching. Daily page counts are not that fancy
but without them people are simply blocked. They cannot start because they
cant even get a basic idea about what was the general article popularity
for a given day.


Best Peter



PS.: I would be willing to put in some time to help you folks in any way I
can.


2018-02-22 21:56 GMT+01:00 Dan Andreescu <dandree...@wikimedia.org>:

> My view had been informed by the documentation at
>> https://dumps.wikimedia.org/other/pagecounts-ez/:
>>
>> Hourly page views per article for around 30 million article titles (Sept
>>> 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
>>> without losing granularity), corrected, reformatted. Daily files and two
>>> monthly files (see notes below).
>>
>>
>> Regarding the claim that pagecounts-ez has data back to when wikimedia
>> started tracking pageviews, I'll point out another error in the
>> documentation that may have led to that view. The documentation claims that
>> data is available from 2007 onward:
>>
>>  From 2007 to May 2015: derived from Domas' pagecount/projectcount files
>>
>>
>> However, if you check out the actual files (https://dumps.wikimedia.org/o
>> ther/pagecounts-ez/merged/), you'll see that the pagecounts only go back
>> to late 2011.
>>
>
> Ah, yes, but the projectcount files go back to 2007-12, that's where that
> confusion comes from, we should clarify or generate the old data.  I'm not
> sure whether this is easy, but I think it's fairly straightforward and I've
> opened a task for it: https://phabricator.wikimedia.org/T188041 (we have
> a lot of work in our backlog, though, so we probably won't be able to get
> to this for a bit).
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to