Re: [Analytics] A new format for the pageview dumps

Timo Tijhof Wed, 25 Mar 2015 00:48:33 -0700

Hm.. interesting. A few random ideas:

I suppose extra columns wouldn't have to break existing scripts.

When querying total count, columns on one line would make it easier when we add 
columns if their logic can do "sum any columns after title". Which means if we 
add another column (which to some degree, means fragmenting traffic into more 
buckets, so you'd want to add up whatever columns there are at all times) their 
total count would stay accurate. 

When querying for a specific type of traffic in the future (e.g. only traffic 
from *.applewatch.*.org ::troll::) one would have to be careful to account for 
the column not existing in older dumps.

Perhaps:
* add a column for total count per title (adding up N+2 columns feels like 
something query tools generally would not support).
* give each type of traffic its own column (pv_desktop, pv_mobile, pv_zero, 
pv_app).

Or perhaps
* go back to separate lines for each traffic type (requiring users to be aware 
of the different types of traffic and their url permutation; or possibly we 
could use a normalised url and a dedicated column for traffic type).

This would mean users don't have to know the url or special identifier 
permutations and can simply skip that column if they want total counts. Very 
query-friendly.

canonical_project_hostname - traffic_source - encoded_title - pageviews_count
de.wikivoyage.org, desktop, "Florence", 920
de.wikivoyage.org. mobile, "Florence", 32

/me goes back to lurking from the bushes,

— Timo

On 15 Mar 2015, at 15:44, Aaron Halfaker <[email protected]> wrote:

> It seems that this schema will need to be modified to include new types of 
> view counts.  For example, it doesn't seem like views via the Wikipedia App 
> are included (or maybe they are within "mobile").  If we wanted to have them 
> separate, we'd need to add another column -- changing the data format/schema. 
>  For any additional sources of views that we'd need to separate in the 
> future, we'd need to add another column again.
> 
> So, in a way, this format is less 'normalized' than the previous which would 
> have out the counts from different sources on different lines.  This will 
> require a data consumer who wants to process historic files to be able to 
> handle files with differing numbers of columns and think of the incoming data 
> as "full_project_url - title - desktop_counts - mobile_counts - ..." where 
> "..." could contain <something> or <nothing> but must be handled regardless.
> 
> I think that this is undesirable -- but not *that* undesirable.  
> 
> -Aaron
> 

On Mar 14, 2015, at 14:00, Oliver Keyes wrote:

> [..]
> 
> What I'd like to use as a new format is:
> 
> full_project_url - encoded_title - desktop_pageviews - 
> mobile_and_zero_pageviews
> 
> This file would:
> 
> 1. Include a header row;
> 2. Be formatted as a tab-separated, rather than space-separated, file;
> 3. Exclude bytecounts;
> 4. Include desktop and mobile pageview counts on the same line;
> 5. Use the full project URL ("en.wikivoyage.org") instead of the
> pagecounts-specific notation ("en.v")
> 
> So, as a made-up example, instead of:
> 
> de.m.v Florence 32 9024
> de.v Florence 920 7570
> 
> we'd end up with:
> 
> de.wikivoyage.org Florence 920 32
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] A new format for the pageview dumps

Reply via email to