Hi Dylan, 

 

The pagecounts-ez format isn't new, it's been there for years. More importantly 
the merge happens only once each day once all hourly files for that day are 
available (and then daily files are merged into a separate monthly file later 
on). So it's benefits is: one file instead of 24 or 31*24, with all hourly data 
preserved (in sparse arrays), and much less space (each title occurs once 
instead of up to 720 times). It's not intended for sites that need hourly 
updates. It's for archiving and longer term trend analysis.

 

Fortunately your migration task is much easier: just redirect your download 
script to https://dumps.wikimedia.org/other/pageviews/

 

You'll find the same hourly files you used up till now, in a downwards 
compatible scheme. 

 

Main differences are: counts in new hourly files are cleaner (bot requests 
filtered out) and for each wiki requests from mobile devices are now included 
(desktop and mobile counts on separate lines).

 

I hope this helps,

 

Cheers,

Erik Zachte

 

 

From: Analytics [mailto:[email protected]] On Behalf Of 
Jaime Crespo
Sent: Wednesday, August 17, 2016 16:02
To: A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Subject: Re: [Analytics] Urgent Data Issue

 

> we will be taking up more Wikimedia bandwidth

Please, note that from the operations side of things (Disclaimer: I am *not* a 
netops), I have the understanding that pure bandwidth usage is currently a 
non-issue (it is mostly a fixed cost, rather than a variable one). Hitting 
repeatedly a server is way more "costly" (all things considered, such as server 
purchase and maintenance) that a 1-time dump download. All dump users; use as 
much as you need (without wasting it) to meet your goals and do not worry too 
much about bandwidth.

> Also want to say that we're very thankful for the work you all are doing 
> publishing this dataset, it's enormously useful for entity popularity in our 
> search engine for publishers <https://graphiq.com/search> .

My personal opinion is that, indeed, Analytics' work is very important for our 
mission (free knowledge spreading) and they are doing it great. I do not know 
if that is said enough.

 

On Tue, Aug 16, 2016 at 11:06 PM, Dylan Wenzlau <[email protected]> wrote:

Thank you for the update. No one from our team is on the mailing list, and we 
have not viewed the /other/analytics page before (only the pagecounts-all-sites 
page <https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites>  
and pages linked from there), which explains why we didn't know about this. I 
do see you recently added a link to Phabricator issue though, which is helpful!

 

I am currently rewriting our scripts to utilize the new pagecounts-ez format, 
although I think that this new format means that we will be taking up more 
Wikimedia bandwidth than we did previously, since we will have to re-downoad 
this merged daily file once per hour in order to utilize the hourly stats. 
Previously, we only had to download ~100MB per hour, and now it seems we'll be 
downloading ~350MB per hour. Please correct me if I'm missing something obvious 
here!

 

Also want to say that we're very thankful for the work you all are doing 
publishing this dataset, it's enormously useful for entity popularity in our 
search engine for publishers <https://graphiq.com/search> .

 

On Tue, Aug 16, 2016 at 1:48 PM, Dan Andreescu <[email protected]> wrote:

Dylan, there's also been a deprecation message on the page that links to these 
datasets, since last winter: https://dumps.wikimedia.org/other/analytics/

 

If you know of other places that these datasets are referenced, I'd be happy to 
update the docs and add links to the email threads.  We usually publish 
information about this kind of deprecation on this list well in advance, but 
are open to reaching out in other ways.

 

On Tue, Aug 16, 2016 at 4:13 PM, Nuria Ruiz <[email protected]> wrote:

 

Dylan, 

 

(cc-ing analytics@ public list)

 

Please see announcement about deprecation of datasets: 

 https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html

 

 

Thanks,

 

Nuria

 

 

 

 

 

On Tue, Aug 16, 2016 at 12:53 PM, Dylan Wenzlau <[email protected]> wrote:

It seems the pagecounts-all-sites dumps have completely stopped updating, and I 
don't see any warning or message about why this is the case or whether it's 
currently being resolved. Our company relies pretty heavily on this data, as I 
imagine other projects & companies do as well, so I think it would be useful to 
at least display a big warning message on the documentation pages explaining 
why these are no longer updating.

 

Thanks,


 

-- 

Dylan Wenzlau  |  Director of Engineering   |    
<http://i.imgur.com/jSDIO6Z.png> 

 

 

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

 





 

-- 

Dylan Wenzlau  |  Director of Engineering   |    
<http://i.imgur.com/jSDIO6Z.png> 


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics




-- 

Jaime Crespo

<http://wikimedia.org>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to