[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544987#comment-14544987 ]
Josh Rosen commented on SPARK-7660: ----------------------------------- Note that this affects more than just Spark 1.4.0; I'll trace back and figure out the complete list of affected versions tomorrow, but I think that any version that relied on a Snappy-java library published after mid June or July 2014 may be affected. > Snappy-java buffer-sharing bug leads to data corruption / test failures > ----------------------------------------------------------------------- > > Key: SPARK-7660 > URL: https://issues.apache.org/jira/browse/SPARK-7660 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 1.4.0 > Reporter: Josh Rosen > Priority: Blocker > > snappy-java contains a bug that can lead to situations where separate > SnappyOutputStream instances end up sharing the same input and output > buffers, which can lead to data corruption issues. See > https://github.com/xerial/snappy-java/issues/107 for my upstream bug report > and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this > issue. > I discovered this issue because the buffer-sharing was leading to a test > failure in JavaAPISuite: one of the repartition-and-sort tests was returning > the wrong answer because both tasks wrote their output using the same > compression buffers and one task won the race, causing its output to be > written to both shuffle output files. As a result, the test returned the > result of collecting one partition twice. > The buffer-sharing can only occur if {{close()}} is called twice on the same > SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for > a more precise description of when this issue may occur, see my upstream > tickets). I think that this double-close happens somewhere in some test code > that was added as part of my Tungsten shuffle patch, exposing this bug (to > see this, download a recent build of master and run > https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to > force the test execution order that triggers the bug). > I think that it's rare that this bug would lead to silent failures like this. > In more realistic workloads that aren't writing only a handful of bytes per > task, I would expect this issue to lead to stream corruption issues like > SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org