+1 to having both page_id and (current?) page_title
> On Mar 14, 2015, at 18:58, Oliver Keyes <[email protected]> wrote: > > Makes sense :). So, normalised titles at a minimum, and ideally both > normalised titles and pageID? (the disadvantage of just pageID is, of > course, having to look the darn thing up. The disadvantage of title is > that redirects that happen after-the-fact are a thing. Both would > solve for this). > > On 14 March 2015 at 17:35, Roni Wiener <[email protected]> wrote: >> >> Sounds great, I believe that normalization of the title will be very useful >> for future researchers and usages, so as adding the pageId. >> Currently it is not always straight forward to correlate the wikipedia page >> with the unnormalized title >> >> >>> On Mar 14, 2015, at 14:00, [email protected] wrote: >>> >>> Send Analytics mailing list submissions to >>> [email protected] >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> or, via email, send a message with subject or body 'help' to >>> [email protected] >>> >>> You can reach the person managing the list at >>> [email protected] >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Analytics digest..." >>> >>> >>> Today's Topics: >>> >>> 1. [Technical][Request for Comment] A new format for the >>> pageview dumps (Oliver Keyes) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Fri, 13 Mar 2015 15:06:00 -0400 >>> From: Oliver Keyes <[email protected]> >>> To: "A mailing list for the Analytics Team at WMF and everybody who >>> has an interest in Wikipedia and analytics." >>> <[email protected]>, Research into Wikimedia content and >>> communities <[email protected]> >>> Subject: [Analytics] [Technical][Request for Comment] A new format for >>> the pageview dumps >>> Message-ID: >>> <caauqgdcsvg8htcs4vfdzjal2ruejmk5zea+zxcp3o9own-u...@mail.gmail.com> >>> Content-Type: text/plain; charset=UTF-8 >>> >>> So, we've got a new pageviews definition; it's nicely integrated and >>> spitting out TRUE/FALSE values on each row with the best of em. But >>> what does that mean for third-party researchers? >>> >>> Well...not much, at the moment, because the data isn't being released >>> somewhere. But one resource we do have that third-parties use a heck >>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. >>> >>> Due to historical size constrains and decision-making (and by >>> historical I mean: last decade) these have a number of weirdnesses in >>> formatting terms; project identification is done using a notation >>> style not really used anywhere else, mobile/zero/desktop appear on >>> different lines, and the files are space-separated. I'd like to put >>> some volunteer time into spitting out dumps in an easier-to-work-with >>> format, using the new definition, to run in /parallel/ with the >>> existing logs. >>> >>> *The new format* >>> At the moment we have the format: >>> >>> project_notation - encoded_title - pageviews - bytes >>> >>> This puts zero and mobile requests to pageX in a different place to >>> desktop requests, requires some reconstruction of project_notation, >>> and contains (for some use cases) extraneous information - that being >>> the byte-count. The files are also headerless, unquoted and >>> space-separated, which saves space but is sometimes...I think the term >>> is "eeeeh-inducing". >>> >>> What I'd like to use as a new format is: >>> >>> full_project_url - encoded_title - desktop_pageviews - >>> mobile_and_zero_pageviews >>> >>> This file would: >>> >>> 1. Include a header row; >>> 2. Be formatted as a tab-separated, rather than space-separated, file; >>> 3. Exclude bytecounts; >>> 4. Include desktop and mobile pageview counts on the same line; >>> 5. Use the full project URL ("en.wikivoyage.org") instead of the >>> pagecounts-specific notation ("en.v") >>> >>> So, as a made-up example, instead of: >>> >>> de.m.v Florence 32 9024 >>> de.v Florence 920 7570 >>> >>> we'd end up with: >>> >>> de.wikivoyage.org Florence 920 32 >>> >>> In the future we could also work to /normalise/ the title - replacing >>> it with the page title that refers to the actual pageID. This won't >>> impact legacy files, and is currently blocked on the Apps team, but >>> should be viable as soon as that blocker goes away. >>> >>> I've written a script capable of parsing and reformatting the legacy >>> files, so we should be able to backfill in this new format too, if >>> that's wanted (see below). >>> >>> *The size constraints* >>> >>> There really aren't any. Like I said, the historical rationale for a >>> lot of these decisions seems to have been keeping the files small. But >>> by putting requests to the same title from different site versions on >>> the same line, and dropping byte-count, we save enough space that the >>> resulting files are approximately the same size as the old ones - or >>> in many cases, actually smaller. >>> >>> *What I'm asking for* >>> >>> Feedback! What do people think of the new format? What would they like >>> to see that they don't? What don't they need, here? How useful would >>> normalisation be? How useful would backfilling be? >>> >>> *What I'm not asking for* >>> WMF time! Like I said, this is a spare-time project; I've also got >>> volunteers for Code Review and checking, too (Yuvi and Otto). >>> >>> The replacement of the old files! Too many people depend on that >>> format and that definition, and I don't want to make them sad. >>> >>> Thoughts? >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> End of Analytics Digest, Vol 37, Issue 33 >>> ***************************************** >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
