Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
On Thu, Dec 24, 2015 at 2:41 PM, Dan Andreescu <[email protected]> wrote: > I should have started this discussion a while ago, but it's easier to > catch up on work during vacation :) > > We currently have 3 available static file dumps of pageview data. I will > explain them here and explain my thoughts on simplifying the situation. > Feel free to turn this thread into a wiki. > > * PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We > have this data going back to 2007. This is using a very simple pageview > definition which incorrectly counts things like banner views as pageviews > (for example). > * PAGECOUNTS-ALL-SITES > <http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this > data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also > adds traffic from the mobile versions of our sites. But it's still using > the same simple pageview definition. > * PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this > data starting in May 2015. It implements the new and much improved > pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view> > that we now use. This is the same pageview definition used in the pageview > API. This dataset also removes spider traffic and any automata traffic > that we can detect. > > All three datasets are in the same format (Domasz's archive format). > > So, before we can simplify this confusing situation, we need your help and > input about what to keep and how to keep it. Here's the approach I would > take: > > Combine pagecounts-raw with pagecounts-all-sites into a new dataset called > "pagecounts". Keep producing data to this dataset forever, but remove > "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new > data with historical data going back as far as we need. We would explain > on dumps.wikimedia.org/other that this dataset gains mobile data starting > in October 2014, to explain the relative local spike that happens there. > This dataset would remain a pretty bad estimate of actual page views, and > would remain sensitive to automata and spider spikes. But in combination > with the "pageviews" dataset, I think it would be useful. > > What do you all think? Sound off in this thread, and if we have consensus > I'll start the cleanup. > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Thank you. Alex Druk [email protected] (775) 237-8550 Google voice
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
