Dan,

Thanks for the clarification - digging into the files, I see that there are
redirects and more than 30M titles.

My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:

Hourly page views per article for around 30 million article titles (Sept
> 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
> without losing granularity), corrected, reformatted. Daily files and two
> monthly files (see notes below).


Regarding the claim that pagecounts-ez has data back to when wikimedia
started tracking pageviews, I'll point out another error in the
documentation that may have led to that view. The documentation claims that
data is available from 2007 onward:

 From 2007 to May 2015: derived from Domas' pagecount/projectcount files


However, if you check out the actual files (
https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that
the pagecounts only go back to late 2011.

I never bothered with pagecounts-ez because of the belief that only 30
million articles were covered in the datasets and because the data isn't
available for the 2008-2011 period. I always jumped straight to using
pagecounts-raw because we need access to newer and older page titles and so
as to not have to merge as many formats if I wanted a dataset ranging back
to 2008.

Now that I know that the dataset has broader coverage, I would find it
extremely helpful if some jobs could be run to generate pagecounts-ez from
2008-2011.

On Thu, Feb 22, 2018 at 12:17 PM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> John: I think you may have gotten the wrong impression from some
> description, and I'm not sure what you were looking at.  As far as I know,
> pagecounts-ez is the most comprehensive dataset we have with pageviews from
> as early as we started tracking them.  It should have all articles,
> regardless when they were created, regardless whether they're redirects or
> not.  If you find evidence to the contrary, either in docs or the data
> itself, please let me know.
>
> Tilman: thanks very much for the docs update, I'm never quite sure what is
> and isn't clear, and I'm afraid we have a mountain of documentation that
> might defeat its own purpose.
>
> On Wed, Feb 21, 2018 at 2:34 PM, John Urbanik <jurba...@predata.com>
> wrote:
>
>> Dan,
>>
>> One clarification point I'd make is that while the data is lossless for
>> 30M articles, it is 100% lossy for redirects, old page names, or pages
>> created after September 2013, correct?
>>
>> John
>>
>> On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu <dandree...@wikimedia.org>
>> wrote:
>>
>>> Hi Lars,
>>>
>>> You have a couple of options:
>>>
>>> 1. download the data in lossless compressed form,
>>> https://dumps.wikimedia.org/other/pagecounts-ez/  The format is clever
>>> and doesn't lose granularity, should be a lot quicker than pagecounts-raw
>>> (this is basically what stats.grok.se did with the data as well, so
>>> downloading this way should be equivalent)
>>> 2. work on Toolforge, a virtual cloud that's on the same network as the
>>> data, so getting the data is a lot faster and you can use our compute
>>> resources (free, of course): https://wikitech.wiki
>>> media.org/wiki/Portal:Toolforge
>>>
>>> If you decide to go with the second option, the IRC channel where they
>>> support folks like you is #wikimedia-cloud and you can always find me there
>>> as milimetric.
>>>
>>>
>>> On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand <
>>> larshillebr...@icloud.com> wrote:
>>>
>>>> Dear Analytics Team,
>>>>
>>>> I am a M.Sc. student at Copenhagen Business School. For my Master
>>>> Thesis I would like to use page views data from certain Wikipedia articles.
>>>> I found out that in July 2015 a new API was created which delivers this
>>>> data. However, for my project I have to use data from before 2015.
>>>> In my further search I found out that the old page views data exists (
>>>> https://dumps.wikimedia.org/other/pagecounts-raw/) and until March
>>>> 2017 it could be queried by using stats.grok.se. Unfortunately, this
>>>> site does no longer exists, which is why I cannot filter and query the raw
>>>> data in .gz format on the webpage.
>>>>
>>>> Are there any possibilities to get the page views data for certain
>>>> articles from before July 2017?
>>>>
>>>> Thanks a lot and best regards,
>>>>
>>>> Lars Hillebrand
>>>>
>>>> PS: I am conducting my research in R and for the post 2015 data the
>>>> package “pageviews” works great.
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>>
>> *JOHN URBANIK*
>> Lead Data Engineer
>>
>> jurba...@predata.com
>> 860.878.1010 <(860)%20878-1010>
>> 379 West Broadway
>> New York, NY 10012
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*JOHN URBANIK*
Lead Data Engineer

jurba...@predata.com
860.878.1010
379 West Broadway
New York, NY 10012
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to