Yay for +1s!

Aaron: I think your commentary (re: future maintainability being an
argument for distinct rows) makes sense. Let's operate on that basis -
it actually makes the query easier to write ;p.

Kevin: I'm not sure what value there'd be. I mean, there's page-size,
maybe? But pageID gives us that (or should).

On 16 March 2015 at 14:20, Andrew Otto <[email protected]> wrote:
> +1 to having both page_id and (current?) page_title
>
>
>> On Mar 14, 2015, at 18:58, Oliver Keyes <[email protected]> wrote:
>>
>> Makes sense :). So, normalised titles at a minimum, and ideally both
>> normalised titles and pageID? (the disadvantage of just pageID is, of
>> course, having to look the darn thing up. The disadvantage of title is
>> that redirects that happen after-the-fact are a thing. Both would
>> solve for this).
>>
>> On 14 March 2015 at 17:35, Roni Wiener <[email protected]> wrote:
>>>
>>> Sounds great, I believe that normalization of the title will be very useful 
>>> for future researchers and usages, so as adding the pageId.
>>> Currently it is not always straight forward to correlate the wikipedia page 
>>> with the unnormalized title
>>>
>>>
>>>> On Mar 14, 2015, at 14:00, [email protected] wrote:
>>>>
>>>> Send Analytics mailing list submissions to
>>>>   [email protected]
>>>>
>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>   https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> or, via email, send a message with subject or body 'help' to
>>>>   [email protected]
>>>>
>>>> You can reach the person managing the list at
>>>>   [email protected]
>>>>
>>>> When replying, please edit your Subject line so it is more specific
>>>> than "Re: Contents of Analytics digest..."
>>>>
>>>>
>>>> Today's Topics:
>>>>
>>>>  1. [Technical][Request for Comment] A new format for the
>>>>     pageview dumps (Oliver Keyes)
>>>>
>>>>
>>>> ----------------------------------------------------------------------
>>>>
>>>> Message: 1
>>>> Date: Fri, 13 Mar 2015 15:06:00 -0400
>>>> From: Oliver Keyes <[email protected]>
>>>> To: "A mailing list for the Analytics Team at WMF and everybody who
>>>>   has an    interest in Wikipedia and analytics."
>>>>   <[email protected]>,    Research into Wikimedia content and
>>>>   communities    <[email protected]>
>>>> Subject: [Analytics] [Technical][Request for Comment] A new format for
>>>>   the    pageview dumps
>>>> Message-ID:
>>>>   <caauqgdcsvg8htcs4vfdzjal2ruejmk5zea+zxcp3o9own-u...@mail.gmail.com>
>>>> Content-Type: text/plain; charset=UTF-8
>>>>
>>>> So, we've got a new pageviews definition; it's nicely integrated and
>>>> spitting out TRUE/FALSE values on each row with the best of em. But
>>>> what does that mean for third-party researchers?
>>>>
>>>> Well...not much, at the moment, because the data isn't being released
>>>> somewhere. But one resource we do have that third-parties use a heck
>>>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
>>>>
>>>> Due to historical size constrains and decision-making (and by
>>>> historical I mean: last decade) these have a number of weirdnesses in
>>>> formatting terms; project identification is done using a notation
>>>> style not really used anywhere else, mobile/zero/desktop appear on
>>>> different lines, and the files are space-separated. I'd like to put
>>>> some volunteer time into spitting out dumps in an easier-to-work-with
>>>> format, using the new definition, to run in /parallel/ with the
>>>> existing logs.
>>>>
>>>> *The new format*
>>>> At the moment we have the format:
>>>>
>>>> project_notation - encoded_title - pageviews - bytes
>>>>
>>>> This puts zero and mobile requests to pageX in a different place to
>>>> desktop requests, requires some reconstruction of project_notation,
>>>> and contains (for some use cases) extraneous information - that being
>>>> the byte-count. The files are also headerless, unquoted and
>>>> space-separated, which saves space but is sometimes...I think the term
>>>> is "eeeeh-inducing".
>>>>
>>>> What I'd like to use as a new format is:
>>>>
>>>> full_project_url - encoded_title - desktop_pageviews - 
>>>> mobile_and_zero_pageviews
>>>>
>>>> This file would:
>>>>
>>>> 1. Include a header row;
>>>> 2. Be formatted as a tab-separated, rather than space-separated, file;
>>>> 3. Exclude bytecounts;
>>>> 4. Include desktop and mobile pageview counts on the same line;
>>>> 5. Use the full project URL ("en.wikivoyage.org") instead of the
>>>> pagecounts-specific notation ("en.v")
>>>>
>>>> So, as a made-up example, instead of:
>>>>
>>>> de.m.v Florence 32 9024
>>>> de.v Florence 920 7570
>>>>
>>>> we'd end up with:
>>>>
>>>> de.wikivoyage.org Florence 920 32
>>>>
>>>> In the future we could also work to /normalise/ the title - replacing
>>>> it with the page title that refers to the actual pageID. This won't
>>>> impact legacy files, and is currently blocked on the Apps team, but
>>>> should be viable as soon as that blocker goes away.
>>>>
>>>> I've written a script capable of parsing and reformatting the legacy
>>>> files, so we should be able to backfill in this new format too, if
>>>> that's wanted (see below).
>>>>
>>>> *The size constraints*
>>>>
>>>> There really aren't any. Like I said, the historical rationale for a
>>>> lot of these decisions seems to have been keeping the files small. But
>>>> by putting requests to the same title from different site versions on
>>>> the same line, and dropping byte-count, we save enough space that the
>>>> resulting files are approximately the same size as the old ones - or
>>>> in many cases, actually smaller.
>>>>
>>>> *What I'm asking for*
>>>>
>>>> Feedback! What do people think of the new format? What would they like
>>>> to see that they don't? What don't they need, here? How useful would
>>>> normalisation be? How useful would backfilling be?
>>>>
>>>> *What I'm not asking for*
>>>> WMF time! Like I said, this is a spare-time project; I've also got
>>>> volunteers for Code Review and checking, too (Yuvi and Otto).
>>>>
>>>> The replacement of the old files! Too many people depend on that
>>>> format and that definition, and I don't want to make them sad.
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>> End of Analytics Digest, Vol 37, Issue 33
>>>> *****************************************
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to