Re: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps

Kevin Leduc Mon, 16 Mar 2015 13:26:33 -0700

I'm curious to why you are dropping the byte-count.  I'm not opposed to it,
just wondering if that data is not valuable.




On Fri, Mar 13, 2015 at 12:06 PM, Oliver Keyes <[email protected]> wrote:

> So, we've got a new pageviews definition; it's nicely integrated and
> spitting out TRUE/FALSE values on each row with the best of em. But
> what does that mean for third-party researchers?
>
> Well...not much, at the moment, because the data isn't being released
> somewhere. But one resource we do have that third-parties use a heck
> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
>
> Due to historical size constrains and decision-making (and by
> historical I mean: last decade) these have a number of weirdnesses in
> formatting terms; project identification is done using a notation
> style not really used anywhere else, mobile/zero/desktop appear on
> different lines, and the files are space-separated. I'd like to put
> some volunteer time into spitting out dumps in an easier-to-work-with
> format, using the new definition, to run in /parallel/ with the
> existing logs.
>
> *The new format*
> At the moment we have the format:
>
> project_notation - encoded_title - pageviews - bytes
>
> This puts zero and mobile requests to pageX in a different place to
> desktop requests, requires some reconstruction of project_notation,
> and contains (for some use cases) extraneous information - that being
> the byte-count. The files are also headerless, unquoted and
> space-separated, which saves space but is sometimes...I think the term
> is "eeeeh-inducing".
>
> What I'd like to use as a new format is:
>
> full_project_url - encoded_title - desktop_pageviews -
> mobile_and_zero_pageviews
>
> This file would:
>
> 1. Include a header row;
> 2. Be formatted as a tab-separated, rather than space-separated, file;
> 3. Exclude bytecounts;
> 4. Include desktop and mobile pageview counts on the same line;
> 5. Use the full project URL ("en.wikivoyage.org") instead of the
> pagecounts-specific notation ("en.v")
>
> So, as a made-up example, instead of:
>
> de.m.v Florence 32 9024
> de.v Florence 920 7570
>
> we'd end up with:
>
> de.wikivoyage.org Florence 920 32
>
> In the future we could also work to /normalise/ the title - replacing
> it with the page title that refers to the actual pageID. This won't
> impact legacy files, and is currently blocked on the Apps team, but
> should be viable as soon as that blocker goes away.
>
> I've written a script capable of parsing and reformatting the legacy
> files, so we should be able to backfill in this new format too, if
> that's wanted (see below).
>
> *The size constraints*
>
> There really aren't any. Like I said, the historical rationale for a
> lot of these decisions seems to have been keeping the files small. But
> by putting requests to the same title from different site versions on
> the same line, and dropping byte-count, we save enough space that the
> resulting files are approximately the same size as the old ones - or
> in many cases, actually smaller.
>
> *What I'm asking for*
>
> Feedback! What do people think of the new format? What would they like
> to see that they don't? What don't they need, here? How useful would
> normalisation be? How useful would backfilling be?
>
> *What I'm not asking for*
> WMF time! Like I said, this is a spare-time project; I've also got
> volunteers for Code Review and checking, too (Yuvi and Otto).
>
> The replacement of the old files! Too many people depend on that
> format and that definition, and I don't want to make them sad.
>
> Thoughts?
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps

Reply via email to