Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Sungwoo Park Fri, 29 Sep 2023 07:05:47 -0700

Since the partition split has a good chance to contain data from almost all
upstream
mapper tasks, the cost of re-computing all upstream tasks may have little
difference
to re-computing the actual mapper tasks in most cases. Of course it's not
always true.


To change from 'complete' to 'incomplete' also needs to refactor Worker's
logic, which
currently assumes that the succeeded attempts will not be changed after
final committing
files.

a subset of succeeded attempts. In Erik's proposal, the whole upstream
stage will be rerun when data lost.


Thank you for your response --- things are now much clearer.

From your comments shown above, let me assume that:


1. The whole upstream stage is rerun in the case of read failure.

2. Currently it is not easy to change the state of a shuffleId from'complete' to 'incomplete'.


Then, for Celeborn-MR3, I would like to experiment with the following
approach:

1. If read failures occur for shuffleId #1, we pick up a new shuffleId#2.

2. The upstream stage (or Vertex in the case of MR3) re-executes all tasksagain, but writes the output to shuffleId #2.


3. Tasks in the downstream stage re-try by reading from shuffleId #2.

Do you think this approach makes sense under the current architecture ofCeleborn? If this approach is feasible, MR3 only needs to be notified ofread failures due to lost data by Celeborn ShuffleClient. Or, we couldjust interpret IOException from Celeborn ShuffleClient as read failures,in which case we can implement stage recompute without requiring anyextension of Celeborn. (However, it would be great if Celeborn

ShuffleClient could announce lost data explicitly.)

An industrial user of Hive-MR3-Celeborn is trying hard to save disk usage

on Celeborn workers (which use SSDs of limited capacity), so stagerecompute would be a great new feature to them.


Thank you,

--- Sungwoo

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Reply via email to