Hi all!

Some of you are probably aware of the pagecounts-raw dataset hosted at 
http://dumps.wikimedia.org/other/pagecounts-raw/ 
<http://dumps.wikimedia.org/other/pagecounts-raw/>.  This week, we are making a 
change to how this dataset is generated.  This should be mostly transparent, 
but an announcement is needed just in case anyone notices any differences.

pagecounts-raw has historically been generated by piping the udp2log webrequest 
logs into a C program called webstatscollector[1].  This code is fairly old, 
and the logic it uses to generate pagecounts is out of date.  However, since 
this data has been public for so long, we made an effort to continue to support 
it as is.

We are still in the process of backfilling, but eventually all pagecounts-raw 
data after January 1 2015 will be generated from webrequest data stored in 
HDFS.  This data is collected using Kafka, and pagecounts-raw is now generated 
by Hive.

You may see a slight increase in article counts.  The webrequest data in HDFS 
is less lossy than the udp2log data.

By the way, do you know about the pagecounts-all-sites[2] dataset?  
pagecounts-all-sites is in a similar format to pagecounts-raw, but comes with 
more up to date pagecount logic.  Most importantly, it includes mobile site 
pagecounts. Perhaps you should use pagecounts-all-sites instead of 
pagecounts-raw, eh? :)

-Andrew Otto

[1] https://github.com/wikimedia/analytics-webstatscollector 
<https://github.com/wikimedia/analytics-webstatscollector>
[2] http://dumps.wikimedia.org/other/pagecounts-all-sites/ 
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to