Dan,

One clarification point I'd make is that while the data is lossless for 30M
articles, it is 100% lossy for redirects, old page names, or pages created
after September 2013, correct?

John

On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> Hi Lars,
>
> You have a couple of options:
>
> 1. download the data in lossless compressed form, https://dumps.wikimedia.
> org/other/pagecounts-ez/  The format is clever and doesn't lose
> granularity, should be a lot quicker than pagecounts-raw (this is basically
> what stats.grok.se did with the data as well, so downloading this way
> should be equivalent)
> 2. work on Toolforge, a virtual cloud that's on the same network as the
> data, so getting the data is a lot faster and you can use our compute
> resources (free, of course): https://wikitech.wiki
> media.org/wiki/Portal:Toolforge
>
> If you decide to go with the second option, the IRC channel where they
> support folks like you is #wikimedia-cloud and you can always find me there
> as milimetric.
>
>
> On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand <
> larshillebr...@icloud.com> wrote:
>
>> Dear Analytics Team,
>>
>> I am a M.Sc. student at Copenhagen Business School. For my Master Thesis
>> I would like to use page views data from certain Wikipedia articles. I
>> found out that in July 2015 a new API was created which delivers this data.
>> However, for my project I have to use data from before 2015.
>> In my further search I found out that the old page views data exists (
>> https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017
>> it could be queried by using stats.grok.se. Unfortunately, this site
>> does no longer exists, which is why I cannot filter and query the raw data
>> in .gz format on the webpage.
>>
>> Are there any possibilities to get the page views data for certain
>> articles from before July 2017?
>>
>> Thanks a lot and best regards,
>>
>> Lars Hillebrand
>>
>> PS: I am conducting my research in R and for the post 2015 data the
>> package “pageviews” works great.
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*JOHN URBANIK*
Lead Data Engineer

jurba...@predata.com
860.878.1010
379 West Broadway
New York, NY 10012
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to