[ 
https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557624#comment-14557624
 ] 

Josh Rosen commented on SPARK-7829:
-----------------------------------

Thanks for splitting this off as a sub-issue from SPARK-7308.  This issue might 
be one of the last remaining pieces for explaining some of the shuffle 
corruption issues that we've seen in sort-based shuffle.  A bug here would 
actually be consistent with some of the non-determinism of that issue, since it 
sounds like this issue is only triggered in certain stage retry cases when 
using certain shuffle paths.

As I commented over at SPARK-7308, the best way to address this might be with a 
sort of commit protocol in the ShuffleMapTask code. Some of the fixes that 
you've included for this as part of your other patch seem okay, but I think 
that they're a little messy compared to avoiding the appends in the first 
place.  I'm was wondering whether we could just delete the old file rather than 
appending to it, but that might mess things up if another concurrent downstream 
stage is attempting to fetch from those map output partitions while we're 
recomputing them.

> SortShuffleWriter writes inconsistent data & index files on stage retry
> -----------------------------------------------------------------------
>
>                 Key: SPARK-7829
>                 URL: https://issues.apache.org/jira/browse/SPARK-7829
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.3.1
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>
> When a stage is retried, even if a shuffle map task was successful, it may 
> get retried in any case.  If it happens to get scheduled on the same 
> executor, the old data file is *appended*, while the index file still assumes 
> the data starts in position 0.  This leads to an apparently corrupt shuffle 
> map output, since when the data file is read, the index file points to the 
> wrong location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to