Hi George,

 

Server mishaps often had to do with congestion, traffic overload being worsened 
by non-essential routines running in parallel on the same server in early 
years. 

I can't comment on the precise reasons per occasion why page view count files 
got missing/corrupt. We haven't kept a journal for that.

 

Here are the dates I know of in last 5+ years with corrupt or incomplete 
counts, that could not be repaired:

BTW I correct for these by extrapolating from remaining files for that month.

 

      next if $file ge "projectcounts-20100611-000000" and $file lt 
"projectcounts-20100617-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20100627-000000" and $file lt 
"projectcounts-20100628-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20110908-000000" and $file lt 
"projectcounts-20110915-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20111223-010000" and $file lt 
"projectcounts-20111226-160000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20120413-000000" and $file lt 
"projectcounts-20120417-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20121214-000000" and $file lt 
"projectcounts-20130108-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20130723-000000" and $file lt 
"projectcounts-20130724-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20140105-000000" and $file lt 
"projectcounts-20140107-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20140827-000000" and $file lt 
"projectcounts-20140828-000000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150803-180000" and $file lt 
"projectcounts-20150803-230000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150810-150000" and $file lt 
"projectcounts-20150810-210000" ; # bad measurements on these dates

      next if $file ge "projectcounts-20150811-170000" and $file lt 
"projectcounts-20150811-180000" ; # bad measurements on these dates

 

Two or three larger periods of massive undercounting are not listed here, as 
these could be repaired mostly on the per-wiki aggregation level. [1]

One ran for 7 months, and at its peak we lost 1/3 of messages, 
http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-news/ 

 

I hope this helps,

 

Cheers, Erik

 

[1] by deducing the hourly loss rate per server from average gap between 
sequence numbers (which should be on average 1000 with the sampled log).

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Joseph Allemandou
Sent: Monday, September 14, 2015 15:05
To: George Gkotsis; A mailing list for the Analytics Team at WMF and everybody 
who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] corrupted and missing log files

 

Hi George,

I don't really know about historical numbers :(

I forward your message to the Analytics mailing list to get some more help :)

Cheers

Joseph

 

---------- Forwarded message ----------
From: George Gkotsis <[email protected]>
Date: Mon, Sep 14, 2015 at 2:36 PM
Subject: corrupted and missing log files
To: [email protected], [email protected], [email protected], 
[email protected]



Greetings Wikimedia Analytics team!

 

First, thanks for your amazing work! Your work has amazing impact to everyone, 
including researchers like me.

 

My name is George Gkotsis and I am a post-doctoral research fellow for King's 
College London. I have recently finished downloading the massive weblog files 
dataset and I am trying to "tame" the beast. As part of this process, I am 
reading all .gz files that concern WIkimedia page visits (downloaded from 
http://dumps.wikimedia.org/other/pagecounts-raw/*).

 

Unless I am mistaken, I have found cases of either missing or corrupt archives. 
I paste a few examples I randomly sampled below:

 

Missing:

http://dumps.wikimedia.org/other/pagecounts-raw/2010/2010-07/pagecounts-20100705-09**

http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081021-23**

http://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/pagecounts-20090925-23**

 

Corrupted:

pagecounts-20080304-030000.gz

pagecounts-20080304-140000.gz

pagecounts-20080304-150000.gz

pagecounts-20090921-160000.gz

(the list is quite long and I haven't finished processing it, but I can give 
you a full log file)

 

Could you provide some feedback concerning the above cases?

 

Best regards,

George

 

-- 

/g





 

-- 

Joseph Allemandou

Data Engineer @ Wikimedia Foundation

IRC: joal

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to