I'm curious to why you are dropping the byte-count. I'm not opposed to it, just wondering if that data is not valuable.
On Fri, Mar 13, 2015 at 12:06 PM, Oliver Keyes <[email protected]> wrote: > So, we've got a new pageviews definition; it's nicely integrated and > spitting out TRUE/FALSE values on each row with the best of em. But > what does that mean for third-party researchers? > > Well...not much, at the moment, because the data isn't being released > somewhere. But one resource we do have that third-parties use a heck > of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. > > Due to historical size constrains and decision-making (and by > historical I mean: last decade) these have a number of weirdnesses in > formatting terms; project identification is done using a notation > style not really used anywhere else, mobile/zero/desktop appear on > different lines, and the files are space-separated. I'd like to put > some volunteer time into spitting out dumps in an easier-to-work-with > format, using the new definition, to run in /parallel/ with the > existing logs. > > *The new format* > At the moment we have the format: > > project_notation - encoded_title - pageviews - bytes > > This puts zero and mobile requests to pageX in a different place to > desktop requests, requires some reconstruction of project_notation, > and contains (for some use cases) extraneous information - that being > the byte-count. The files are also headerless, unquoted and > space-separated, which saves space but is sometimes...I think the term > is "eeeeh-inducing". > > What I'd like to use as a new format is: > > full_project_url - encoded_title - desktop_pageviews - > mobile_and_zero_pageviews > > This file would: > > 1. Include a header row; > 2. Be formatted as a tab-separated, rather than space-separated, file; > 3. Exclude bytecounts; > 4. Include desktop and mobile pageview counts on the same line; > 5. Use the full project URL ("en.wikivoyage.org") instead of the > pagecounts-specific notation ("en.v") > > So, as a made-up example, instead of: > > de.m.v Florence 32 9024 > de.v Florence 920 7570 > > we'd end up with: > > de.wikivoyage.org Florence 920 32 > > In the future we could also work to /normalise/ the title - replacing > it with the page title that refers to the actual pageID. This won't > impact legacy files, and is currently blocked on the Apps team, but > should be viable as soon as that blocker goes away. > > I've written a script capable of parsing and reformatting the legacy > files, so we should be able to backfill in this new format too, if > that's wanted (see below). > > *The size constraints* > > There really aren't any. Like I said, the historical rationale for a > lot of these decisions seems to have been keeping the files small. But > by putting requests to the same title from different site versions on > the same line, and dropping byte-count, we save enough space that the > resulting files are approximately the same size as the old ones - or > in many cases, actually smaller. > > *What I'm asking for* > > Feedback! What do people think of the new format? What would they like > to see that they don't? What don't they need, here? How useful would > normalisation be? How useful would backfilling be? > > *What I'm not asking for* > WMF time! Like I said, this is a spare-time project; I've also got > volunteers for Code Review and checking, too (Yuvi and Otto). > > The replacement of the old files! Too many people depend on that > format and that definition, and I don't want to make them sad. > > Thoughts? > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
