[ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082878#comment-14082878 ]
Matei Zaharia commented on SPARK-2532: -------------------------------------- I'm going to create a few sub-tasks for the major improvements here to make it easier to put some of them in 1.1 and leave others for later. > Fix issues with consolidated shuffle > ------------------------------------ > > Key: SPARK-2532 > URL: https://issues.apache.org/jira/browse/SPARK-2532 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0 > Environment: All > Reporter: Mridul Muralidharan > Assignee: Mridul Muralidharan > Priority: Critical > Fix For: 1.1.0 > > > Will file PR with changes as soon as merge is done (earlier merge became > outdated in 2 weeks unfortunately :) ). > Consolidated shuffle is broken in multiple ways in spark : > a) Task failure(s) can cause the state to become inconsistent. > b) Multiple revert's or combination of close/revert/close can cause the state > to be inconsistent. > (As part of exception/error handling). > c) Some of the api in block writer causes implementation issues - for > example: a revert is always followed by close : but the implemention tries to > keep them separate, resulting in surface for errors. > d) Fetching data from consolidated shuffle files can go badly wrong if the > file is being actively written to : it computes length by subtracting next > offset from current offset (or length if this is last offset)- the latter > fails when fetch is happening in parallel to write. > Note, this happens even if there are no task failures of any kind ! > This usually results in stream corruption or decompression errors. -- This message was sent by Atlassian JIRA (v6.2#6252)