+1 to having both page_id and (current?) page_title

> On Mar 14, 2015, at 18:58, Oliver Keyes <[email protected]> wrote:
> 
> Makes sense :). So, normalised titles at a minimum, and ideally both
> normalised titles and pageID? (the disadvantage of just pageID is, of
> course, having to look the darn thing up. The disadvantage of title is
> that redirects that happen after-the-fact are a thing. Both would
> solve for this).
> 
> On 14 March 2015 at 17:35, Roni Wiener <[email protected]> wrote:
>> 
>> Sounds great, I believe that normalization of the title will be very useful 
>> for future researchers and usages, so as adding the pageId.
>> Currently it is not always straight forward to correlate the wikipedia page 
>> with the unnormalized title
>> 
>> 
>>> On Mar 14, 2015, at 14:00, [email protected] wrote:
>>> 
>>> Send Analytics mailing list submissions to
>>>   [email protected]
>>> 
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>   https://lists.wikimedia.org/mailman/listinfo/analytics
>>> or, via email, send a message with subject or body 'help' to
>>>   [email protected]
>>> 
>>> You can reach the person managing the list at
>>>   [email protected]
>>> 
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Analytics digest..."
>>> 
>>> 
>>> Today's Topics:
>>> 
>>>  1. [Technical][Request for Comment] A new format for the
>>>     pageview dumps (Oliver Keyes)
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> Message: 1
>>> Date: Fri, 13 Mar 2015 15:06:00 -0400
>>> From: Oliver Keyes <[email protected]>
>>> To: "A mailing list for the Analytics Team at WMF and everybody who
>>>   has an    interest in Wikipedia and analytics."
>>>   <[email protected]>,    Research into Wikimedia content and
>>>   communities    <[email protected]>
>>> Subject: [Analytics] [Technical][Request for Comment] A new format for
>>>   the    pageview dumps
>>> Message-ID:
>>>   <caauqgdcsvg8htcs4vfdzjal2ruejmk5zea+zxcp3o9own-u...@mail.gmail.com>
>>> Content-Type: text/plain; charset=UTF-8
>>> 
>>> So, we've got a new pageviews definition; it's nicely integrated and
>>> spitting out TRUE/FALSE values on each row with the best of em. But
>>> what does that mean for third-party researchers?
>>> 
>>> Well...not much, at the moment, because the data isn't being released
>>> somewhere. But one resource we do have that third-parties use a heck
>>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
>>> 
>>> Due to historical size constrains and decision-making (and by
>>> historical I mean: last decade) these have a number of weirdnesses in
>>> formatting terms; project identification is done using a notation
>>> style not really used anywhere else, mobile/zero/desktop appear on
>>> different lines, and the files are space-separated. I'd like to put
>>> some volunteer time into spitting out dumps in an easier-to-work-with
>>> format, using the new definition, to run in /parallel/ with the
>>> existing logs.
>>> 
>>> *The new format*
>>> At the moment we have the format:
>>> 
>>> project_notation - encoded_title - pageviews - bytes
>>> 
>>> This puts zero and mobile requests to pageX in a different place to
>>> desktop requests, requires some reconstruction of project_notation,
>>> and contains (for some use cases) extraneous information - that being
>>> the byte-count. The files are also headerless, unquoted and
>>> space-separated, which saves space but is sometimes...I think the term
>>> is "eeeeh-inducing".
>>> 
>>> What I'd like to use as a new format is:
>>> 
>>> full_project_url - encoded_title - desktop_pageviews - 
>>> mobile_and_zero_pageviews
>>> 
>>> This file would:
>>> 
>>> 1. Include a header row;
>>> 2. Be formatted as a tab-separated, rather than space-separated, file;
>>> 3. Exclude bytecounts;
>>> 4. Include desktop and mobile pageview counts on the same line;
>>> 5. Use the full project URL ("en.wikivoyage.org") instead of the
>>> pagecounts-specific notation ("en.v")
>>> 
>>> So, as a made-up example, instead of:
>>> 
>>> de.m.v Florence 32 9024
>>> de.v Florence 920 7570
>>> 
>>> we'd end up with:
>>> 
>>> de.wikivoyage.org Florence 920 32
>>> 
>>> In the future we could also work to /normalise/ the title - replacing
>>> it with the page title that refers to the actual pageID. This won't
>>> impact legacy files, and is currently blocked on the Apps team, but
>>> should be viable as soon as that blocker goes away.
>>> 
>>> I've written a script capable of parsing and reformatting the legacy
>>> files, so we should be able to backfill in this new format too, if
>>> that's wanted (see below).
>>> 
>>> *The size constraints*
>>> 
>>> There really aren't any. Like I said, the historical rationale for a
>>> lot of these decisions seems to have been keeping the files small. But
>>> by putting requests to the same title from different site versions on
>>> the same line, and dropping byte-count, we save enough space that the
>>> resulting files are approximately the same size as the old ones - or
>>> in many cases, actually smaller.
>>> 
>>> *What I'm asking for*
>>> 
>>> Feedback! What do people think of the new format? What would they like
>>> to see that they don't? What don't they need, here? How useful would
>>> normalisation be? How useful would backfilling be?
>>> 
>>> *What I'm not asking for*
>>> WMF time! Like I said, this is a spare-time project; I've also got
>>> volunteers for Code Review and checking, too (Yuvi and Otto).
>>> 
>>> The replacement of the old files! Too many people depend on that
>>> format and that definition, and I don't want to make them sad.
>>> 
>>> Thoughts?
>>> 
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>> 
>>> 
>>> 
>>> ------------------------------
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> 
>>> 
>>> End of Analytics Digest, Vol 37, Issue 33
>>> *****************************************
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to