Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by
Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/

On Thu, Dec 24, 2015 at 2:41 PM, Dan Andreescu <[email protected]>
wrote:

> I should have started this discussion a while ago, but it's easier to
> catch up on work during vacation :)
>
> We currently have 3 available static file dumps of pageview data.  I will
> explain them here and explain my thoughts on simplifying the situation.
> Feel free to turn this thread into a wiki.
>
> * PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>.  We
> have this data going back to 2007.  This is using a very simple pageview
> definition which incorrectly counts things like banner views as pageviews
> (for example).
> * PAGECOUNTS-ALL-SITES
> <http://dumps.wikimedia.org/other/pagecounts-all-sites/>.  We have this
> data starting in late 2014.  Compared to PAGECOUNTS-RAW, this dataset also
> adds traffic from the mobile versions of our sites.  But it's still using
> the same simple pageview definition.
> * PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>.  We have this
> data starting in May 2015.  It implements the new and much improved
> pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view>
> that we now use.  This is the same pageview definition used in the pageview
> API.  This dataset also removes spider traffic and any automata traffic
> that we can detect.
>
> All three datasets are in the same format (Domasz's archive format).
>
> So, before we can simplify this confusing situation, we need your help and
> input about what to keep and how to keep it.  Here's the approach I would
> take:
>
> Combine pagecounts-raw with pagecounts-all-sites into a new dataset called
> "pagecounts".  Keep producing data to this dataset forever, but remove
> "pagecounts-raw" and "pagecounts-all-sites".  This way, we can compare new
> data with historical data going back as far as we need.  We would explain
> on dumps.wikimedia.org/other that this dataset gains mobile data starting
> in October 2014, to explain the relative local spike that happens there.
> This dataset would remain a pretty bad estimate of actual page views, and
> would remain sensitive to automata and spider spikes.  But in combination
> with the "pageviews" dataset, I think it would be useful.
>
> What do you all think?  Sound off in this thread, and if we have consensus
> I'll start the cleanup.
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Thank you.

Alex Druk
[email protected]
(775) 237-8550 Google voice
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to