Hi all,

Now that we’ve had a little space to analyze the problem, I wanted to call out 
a recent webrequest data loss issue that we experienced on two separate 
occasions.

We attempted to upgrade to Kafka 0.8.2.1, and it wasn’t until the second 
attempt that we actually found the problem.  Kafka 0.8.2.1 ships with a buggy 
version of Snappy[1] that causes messages to not be compressed properly.  This 
caused a ~4x increase network and disk I/O around the cluster all at once.

We’ve documented the incidents and the occasions of significant data loss here:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka>

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#Conclusions
 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/20150810-Kafka#Conclusions>

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest 
<https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest>

This loss will affect the output of pagecount* and pageview datasets, as well 
as other webrequest generated statistics.  Please consider statistics that are 
generated from webrequest data using the following UTC hours unreliable:

  2015-08-03T18:00 - 2015-08-03T23:00
  2015-08-10T15:00 - 2015-08-10T21:00
  2015-08-11T17:00 - 2015-08-11T18:00

Many apologies for any inconvenience this causes.  We’ve learned a lot during 
this turmoil, and have a lot of ideas on how to hopefully prevent this from 
happening in the future, and also how to reduce loss and complexity if and when 
it does.  The analytics engineering team will be doing a post mortem on this 
soon, in which we will document these ideas.

Thanks,
-Andrew Otto

[1] https://issues.apache.org/jira/browse/KAFKA-2189 
<https://issues.apache.org/jira/browse/KAFKA-2189>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to