Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help
:)
Cheers
Joseph

---------- Forwarded message ----------
From: George Gkotsis <[email protected]>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: [email protected], [email protected], [email protected],
[email protected]


Greetings Wikimedia Analytics team!

First, thanks for your amazing work! Your work has amazing impact to
everyone, including researchers like me.

My name is George Gkotsis and I am a post-doctoral research fellow for
King's College London. I have recently finished downloading the massive
weblog files dataset and I am trying to "tame" the beast. As part of this
process, I am reading all .gz files that concern WIkimedia page visits
(downloaded from http://dumps.wikimedia.org/other/pagecounts-raw/*).

Unless I am mistaken, I have found cases of either missing or corrupt
archives. I paste a few examples I randomly sampled below:

*Missing:*
http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-20100705-09**
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081021-23**
http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-20090925-23**

*Corrupted:*
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can
give you a full log file)

Could you provide some feedback concerning the above cases?

Best regards,
George

-- 
/g



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to