Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Mridul Muralidharan Sat, 14 Oct 2023 01:25:14 -0700

Hi,

  Spark will try to minimize the recomputation cost as much as possible.
For example, if parent stage was DETERMINATE, it simply needs to recompute
the missing (mapper) partitions (which resulted in fetch failure). Note,
this by itself could require further recomputation in the DAG if the inputs
required to comput the parent partitions are missing, and so on - so it is
dynamic.


Regards,
Mridul

On Sat, Oct 14, 2023 at 2:30 AM Sungwoo Park <o...@pl.postech.ac.kr> wrote:

> > a) If one or more tasks for a stage (and so its shuffle id) is going to
> be
> > recomputed, if it is an INDETERMINATE stage, all shuffle output will be
> > discarded and it will be entirely recomputed (see here
> > <
> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1477
> >
> > ).
>
> If a reducer (in a downstream stage) fails to read data, can we find out
> which tasks should recompute their output? From the previous discussion, I
> thought this was hard (in the current implementation), and we should
> re-execute all tasks in the upstream stage.
>
> Thanks,
>
> --- Sungwoo
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Reply via email to