Hi George,
Server mishaps often had to do with congestion, traffic overload being worsened
by non-essential routines running in parallel on the same server in early
years.
I can't comment on the precise reasons per occasion why page view count files
got missing/corrupt. We haven't kept a journal for that.
Here are the dates I know of in last 5+ years with corrupt or incomplete
counts, that could not be repaired:
BTW I correct for these by extrapolating from remaining files for that month.
next if $file ge "projectcounts-20100611-000000" and $file lt
"projectcounts-20100617-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20100627-000000" and $file lt
"projectcounts-20100628-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20110908-000000" and $file lt
"projectcounts-20110915-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20111223-010000" and $file lt
"projectcounts-20111226-160000" ; # bad measurements on these dates
next if $file ge "projectcounts-20120413-000000" and $file lt
"projectcounts-20120417-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20121214-000000" and $file lt
"projectcounts-20130108-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20130723-000000" and $file lt
"projectcounts-20130724-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140105-000000" and $file lt
"projectcounts-20140107-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20140827-000000" and $file lt
"projectcounts-20140828-000000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150803-180000" and $file lt
"projectcounts-20150803-230000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150810-150000" and $file lt
"projectcounts-20150810-210000" ; # bad measurements on these dates
next if $file ge "projectcounts-20150811-170000" and $file lt
"projectcounts-20150811-180000" ; # bad measurements on these dates
Two or three larger periods of massive undercounting are not listed here, as
these could be repaired mostly on the per-wiki aggregation level. [1]
One ran for 7 months, and at its peak we lost 1/3 of messages,
http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-news/
I hope this helps,
Cheers, Erik
[1] by deducing the hourly loss rate per server from average gap between
sequence numbers (which should be on average 1000 with the sampled log).
From: [email protected]
[mailto:[email protected]] On Behalf Of Joseph Allemandou
Sent: Monday, September 14, 2015 15:05
To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody
who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] corrupted and missing log files
Hi George,
I don't really know about historical numbers :(
I forward your message to the Analytics mailing list to get some more help :)
Cheers
Joseph
---------- Forwarded message ----------
From: George Gkotsis <[email protected]>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: [email protected], [email protected], [email protected],
[email protected]
Greetings Wikimedia Analytics team!
First, thanks for your amazing work! Your work has amazing impact to everyone,
including researchers like me.
My name is George Gkotsis and I am a post-doctoral research fellow for King's
College London. I have recently finished downloading the massive weblog files
dataset and I am trying to "tame" the beast. As part of this process, I am
reading all .gz files that concern WIkimedia page visits (downloaded from
http://dumps.wikimedia.org/other/pagecounts-raw/*).
Unless I am mistaken, I have found cases of either missing or corrupt archives.
I paste a few examples I randomly sampled below:
Missing:
http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-20100705-09**
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081021-23**
http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-20090925-23**
Corrupted:
pagecounts-20080304-030000.gz
pagecounts-20080304-140000.gz
pagecounts-20080304-150000.gz
pagecounts-20090921-160000.gz
(the list is quite long and I haven't finished processing it, but I can give
you a full log file)
Could you provide some feedback concerning the above cases?
Best regards,
George
--
/g
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics