Yay for +1s! Aaron: I think your commentary (re: future maintainability being an argument for distinct rows) makes sense. Let's operate on that basis - it actually makes the query easier to write ;p.
Kevin: I'm not sure what value there'd be. I mean, there's page-size, maybe? But pageID gives us that (or should). On 16 March 2015 at 14:20, Andrew Otto <[email protected]> wrote: > +1 to having both page_id and (current?) page_title > > >> On Mar 14, 2015, at 18:58, Oliver Keyes <[email protected]> wrote: >> >> Makes sense :). So, normalised titles at a minimum, and ideally both >> normalised titles and pageID? (the disadvantage of just pageID is, of >> course, having to look the darn thing up. The disadvantage of title is >> that redirects that happen after-the-fact are a thing. Both would >> solve for this). >> >> On 14 March 2015 at 17:35, Roni Wiener <[email protected]> wrote: >>> >>> Sounds great, I believe that normalization of the title will be very useful >>> for future researchers and usages, so as adding the pageId. >>> Currently it is not always straight forward to correlate the wikipedia page >>> with the unnormalized title >>> >>> >>>> On Mar 14, 2015, at 14:00, [email protected] wrote: >>>> >>>> Send Analytics mailing list submissions to >>>> [email protected] >>>> >>>> To subscribe or unsubscribe via the World Wide Web, visit >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> or, via email, send a message with subject or body 'help' to >>>> [email protected] >>>> >>>> You can reach the person managing the list at >>>> [email protected] >>>> >>>> When replying, please edit your Subject line so it is more specific >>>> than "Re: Contents of Analytics digest..." >>>> >>>> >>>> Today's Topics: >>>> >>>> 1. [Technical][Request for Comment] A new format for the >>>> pageview dumps (Oliver Keyes) >>>> >>>> >>>> ---------------------------------------------------------------------- >>>> >>>> Message: 1 >>>> Date: Fri, 13 Mar 2015 15:06:00 -0400 >>>> From: Oliver Keyes <[email protected]> >>>> To: "A mailing list for the Analytics Team at WMF and everybody who >>>> has an interest in Wikipedia and analytics." >>>> <[email protected]>, Research into Wikimedia content and >>>> communities <[email protected]> >>>> Subject: [Analytics] [Technical][Request for Comment] A new format for >>>> the pageview dumps >>>> Message-ID: >>>> <caauqgdcsvg8htcs4vfdzjal2ruejmk5zea+zxcp3o9own-u...@mail.gmail.com> >>>> Content-Type: text/plain; charset=UTF-8 >>>> >>>> So, we've got a new pageviews definition; it's nicely integrated and >>>> spitting out TRUE/FALSE values on each row with the best of em. But >>>> what does that mean for third-party researchers? >>>> >>>> Well...not much, at the moment, because the data isn't being released >>>> somewhere. But one resource we do have that third-parties use a heck >>>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. >>>> >>>> Due to historical size constrains and decision-making (and by >>>> historical I mean: last decade) these have a number of weirdnesses in >>>> formatting terms; project identification is done using a notation >>>> style not really used anywhere else, mobile/zero/desktop appear on >>>> different lines, and the files are space-separated. I'd like to put >>>> some volunteer time into spitting out dumps in an easier-to-work-with >>>> format, using the new definition, to run in /parallel/ with the >>>> existing logs. >>>> >>>> *The new format* >>>> At the moment we have the format: >>>> >>>> project_notation - encoded_title - pageviews - bytes >>>> >>>> This puts zero and mobile requests to pageX in a different place to >>>> desktop requests, requires some reconstruction of project_notation, >>>> and contains (for some use cases) extraneous information - that being >>>> the byte-count. The files are also headerless, unquoted and >>>> space-separated, which saves space but is sometimes...I think the term >>>> is "eeeeh-inducing". >>>> >>>> What I'd like to use as a new format is: >>>> >>>> full_project_url - encoded_title - desktop_pageviews - >>>> mobile_and_zero_pageviews >>>> >>>> This file would: >>>> >>>> 1. Include a header row; >>>> 2. Be formatted as a tab-separated, rather than space-separated, file; >>>> 3. Exclude bytecounts; >>>> 4. Include desktop and mobile pageview counts on the same line; >>>> 5. Use the full project URL ("en.wikivoyage.org") instead of the >>>> pagecounts-specific notation ("en.v") >>>> >>>> So, as a made-up example, instead of: >>>> >>>> de.m.v Florence 32 9024 >>>> de.v Florence 920 7570 >>>> >>>> we'd end up with: >>>> >>>> de.wikivoyage.org Florence 920 32 >>>> >>>> In the future we could also work to /normalise/ the title - replacing >>>> it with the page title that refers to the actual pageID. This won't >>>> impact legacy files, and is currently blocked on the Apps team, but >>>> should be viable as soon as that blocker goes away. >>>> >>>> I've written a script capable of parsing and reformatting the legacy >>>> files, so we should be able to backfill in this new format too, if >>>> that's wanted (see below). >>>> >>>> *The size constraints* >>>> >>>> There really aren't any. Like I said, the historical rationale for a >>>> lot of these decisions seems to have been keeping the files small. But >>>> by putting requests to the same title from different site versions on >>>> the same line, and dropping byte-count, we save enough space that the >>>> resulting files are approximately the same size as the old ones - or >>>> in many cases, actually smaller. >>>> >>>> *What I'm asking for* >>>> >>>> Feedback! What do people think of the new format? What would they like >>>> to see that they don't? What don't they need, here? How useful would >>>> normalisation be? How useful would backfilling be? >>>> >>>> *What I'm not asking for* >>>> WMF time! Like I said, this is a spare-time project; I've also got >>>> volunteers for Code Review and checking, too (Yuvi and Otto). >>>> >>>> The replacement of the old files! Too many people depend on that >>>> format and that definition, and I don't want to make them sad. >>>> >>>> Thoughts? >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> End of Analytics Digest, Vol 37, Issue 33 >>>> ***************************************** >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
