[
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen reassigned SPARK-7660:
---------------------------------
Assignee: Josh Rosen
> Snappy-java buffer-sharing bug leads to data corruption / test failures
> -----------------------------------------------------------------------
>
> Key: SPARK-7660
> URL: https://issues.apache.org/jira/browse/SPARK-7660
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core
> Affects Versions: 1.4.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Blocker
>
> snappy-java contains a bug that can lead to situations where separate
> SnappyOutputStream instances end up sharing the same input and output
> buffers, which can lead to data corruption issues. See
> https://github.com/xerial/snappy-java/issues/107 for my upstream bug report
> and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this
> issue.
> I discovered this issue because the buffer-sharing was leading to a test
> failure in JavaAPISuite: one of the repartition-and-sort tests was returning
> the wrong answer because both tasks wrote their output using the same
> compression buffers and one task won the race, causing its output to be
> written to both shuffle output files. As a result, the test returned the
> result of collecting one partition twice (see
> https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more
> details).
> The buffer-sharing can only occur if {{close()}} is called twice on the same
> SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for
> a more precise description of when this issue may occur, see my upstream
> tickets). I think that this double-close happens somewhere in some test code
> that was added as part of my Tungsten shuffle patch, exposing this bug (to
> see this, download a recent build of master and run
> https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to
> force the test execution order that triggers the bug).
> I think that it's rare that this bug would lead to silent failures like this.
> In more realistic workloads that aren't writing only a handful of bytes per
> task, I would expect this issue to lead to stream corruption issues like
> SPARK-4105.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]