Hi George, I don't really know about historical numbers :( I forward your message to the Analytics mailing list to get some more help :) Cheers Joseph
---------- Forwarded message ---------- From: George Gkotsis <[email protected]> Date: Mon, Sep 14, 2015 at 2:36 PM Subject: corrupted and missing log files To: [email protected], [email protected], [email protected], [email protected] Greetings Wikimedia Analytics team! First, thanks for your amazing work! Your work has amazing impact to everyone, including researchers like me. My name is George Gkotsis and I am a post-doctoral research fellow for King's College London. I have recently finished downloading the massive weblog files dataset and I am trying to "tame" the beast. As part of this process, I am reading all .gz files that concern WIkimedia page visits (downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*). Unless I am mistaken, I have found cases of either missing or corrupt archives. I paste a few examples I randomly sampled below: *Missing:* http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-20100705-09** http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081021-23** http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-20090925-23** *Corrupted:* pagecounts-20080304-030000.gz pagecounts-20080304-140000.gz pagecounts-20080304-150000.gz pagecounts-20090921-160000.gz (the list is quite long and I haven't finished processing it, but I can give you a full log file) Could you provide some feedback concerning the above cases? Best regards, George -- /g -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
