Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Keyong Zhou Fri, 03 Nov 2023 04:58:03 -0700

Hi Mridul,

I still have a question. DAGScheduler#submitMissingTasks will
only unregisterAllMapAndMergeOutput
if the current ShuffleMapStage is Indeterminate. What if the current stage
is determinate, but its
upstream stage is Indeterminate, and its upstream stage is rerun?


Thanks,
Keyong Zhou

Mridul Muralidharan <mri...@gmail.com> 于2023年10月20日周五 11:15写道：

> To add my response - what I described (w.r.t failing job) applies only to
> ResultStage.
> It walks the lineage DAG to identify all indeterminate parents to rollback.
> If there are only ShuffleMapStages in the set of stages to rollback, it
> will simply discard their output, rollback all of them, and then retry
> these stages (same shuffle-id, a new stage attempt)
>
>
> Regards,
> Mridul
>
>
>
> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan <mri...@gmail.com>
> wrote:
>
> >
> > Good question, and ResultStage is actually special cased in spark as its
> > output could have already been consumed (for example collect() to driver,
> > etc) - and so if it is one of the stages which needs to be rolled back,
> the
> > job is aborted.
> >
> > To illustrate, see the following:
> > -- snip --
> >
> > package org.apache.spark
> >
> >
> > import scala.reflect.ClassTag
> >
> > import org.apache.spark._
> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
> >
> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
> RDD[E](delegate) {
> >
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> >     delegate.compute(split, context)
> >   }
> >
> >   override protected def getPartitions: Array[Partition] =
> >     delegate.partitions
> > }
> >
> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
> DeterministicLevel.INDETERMINATE
> > }
> >
> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> >     val tc = TaskContext.get
> >     if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
> tc.attemptNumber() == 0) {
> >       // Wait for all tasks to be done, then call exit
> >       Thread.sleep(5000)
> >       System.exit(-1)
> >     }
> >     delegate.compute(split, context)
> >   }
> > }
> >
> > // Make sure test_output directory is deleted before running this.
> > //
> > object Test {
> >
> >   def main(args: Array[String]): Unit = {
> >     val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> >     val sc = new SparkContext(conf)
> >
> >     val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 10000,
> 20).map(v => (v, v)))
> >     val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> >     resultRdd.saveAsTextFile("test_output")
> >   }
> > }
> >
> > -- snip --
> >
> >
> >
> > Here, the mapper stage has been forced to be INDETERMINATE.
> > In the reducer stage, the first attempt to compute partition 0 will wait
> for a bit and then exit - since the master is a local-cluster, this results
> in FetchFailure when the second attempt of partition 0 tries to fetch
> shuffle data.
> > When spark tries to regenerate parent shuffle output, it sees that the
> parent is INDETERMINATE - and so fails the entire job.with the message:
> > "
> > org.apache.spark.SparkException: Job aborted due to stage failure: A
> shuffle map stage with indeterminate output was failed and retried.
> However, Spark cannot rollback the ResultStage 1 to re-process the input
> data, and has to fail this job. Please eliminate the indeterminacy by
> checkpointing the RDD before repartition and try again.
> > "
> >
> > This is coming from here <
> https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
> - when rolling back stages, if spark determines that a ResultStage needs to
> be rolled back due to loss of INDETERMINATE output, it will fail the job.
> >
> > Hope this clarifies.
> > Regards,
> > Mridul
> >
> >
> > On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou <zho...@apache.org> wrote:
> >
> >> In fact, I'm wondering if Spark will rerun the whole reduce
> >> ShuffleMapStage
> >> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
> >>
> >> Keyong Zhou <zho...@apache.org> 于2023年10月19日周四 23:00写道：
> >>
> >> > Thanks Erik for bringing up this question, I'm also curious about the
> >> > answer, any feedback is appreciated.
> >> >
> >> > Thanks,
> >> > Keyong Zhou
> >> >
> >> > Erik fang <fme...@gmail.com> 于2023年10月19日周四 22:16写道：
> >> >
> >> >> Mridul,
> >> >>
> >> >> sure, I totally agree SPARK-25299 is a much better solution, as long
> >> as we
> >> >> can get it from spark community
> >> >> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
> >> >> celeborn already has spark-integration code with  [spark] scope)
> >> >>
> >> >> I also have a question about INDETERMINATE stage recompute, and may
> >> need
> >> >> your help
> >> >> The rule for INDETERMINATE ShuffleMapStage rerun is reasonable,
> >> however, I
> >> >> don't find related logic for INDETERMINATE ResultStage rerun in
> >> >> DAGScheduler
> >> >> If INDETERMINATE ShuffleMapStage got entirely recomputed, the
> >> >> corresponding ResultStage should be entirely recomputed as well, per
> my
> >> >> understanding
> >> >>
> >> >> I found https://issues.apache.org/jira/browse/SPARK-25342 to
> rollback
> >> a
> >> >> ResultStage but it was not merged
> >> >> Do you know any context or related ticket for INDETERMINATE
> ResultStage
> >> >> rerun?
> >> >>
> >> >> Thanks in advance!
> >> >>
> >> >> Regards,
> >> >> Erik
> >> >>
> >> >> On Tue, Oct 17, 2023 at 4:23 AM Mridul Muralidharan <
> mri...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> >
> >> >> >
> >> >> > On Mon, Oct 16, 2023 at 11:31 AM Erik fang <fme...@gmail.com>
> wrote:
> >> >> >
> >> >> >> Hi Mridul,
> >> >> >>
> >> >> >> For a),
> >> >> >> DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier()
> >> >> >> <
> >> >>
> >>
> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1975
> >> >> >
> >> >> >> to decide whether the whole stage needs to be recomputed
> >> >> >> I think we can pass the same information to Celeborn in
> >> >> >> ShuffleManager.registerShuffle()
> >> >> >> <
> >> >>
> >>
> https://github.com/apache/spark/blob/721ea9bbb2ff77b6d2f575fdca0aeda84990cc3b/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala#L39
> >> >,
> >> >> since
> >> >> >> RDD in ShuffleDependency contains the RDD object
> >> >> >> It seems Stage.isIndeterminate() is unreadable from
> >> ShuffleDependency,
> >> >> >> but luckily rdd is used internally
> >> >> >>
> >> >> >> def isIndeterminate: Boolean = {
> >> >> >>   rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
> >> >> >> }
> >> >> >>
> >> >> >> Relies on internal implementation is not good, but doable.
> >> >> >> I don't expect spark RDD/Stage implementation changes frequently,
> >> and
> >> >> we
> >> >> >> can discuss with Spark community for a RDD isIndeterminate API if
> >> they
> >> >> >> change it in the future
> >> >> >>
> >> >> >
> >> >> >
> >> >> > Only RDD.getOutputDeterministicLevel is publicly exposed,
> >> >> > RDD.outputDeterministicLevel is not and it is private[spark].
> >> >> > While I dont expect changes to this, it is inherently unstable to
> >> depend
> >> >> > on it.
> >> >> >
> >> >> > Btw, please see the discussion with Sungwoo Park, if Celeborn is
> >> >> > maintaining a reducer oriented view, you will need to recompute all
> >> the
> >> >> > mappers anyway - what you might save is the subset of reducer
> >> partitions
> >> >> > which can be skipped if it is DETERMINATE.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> for c)
> >> >> >> I also considered a similar solution in celeborn
> >> >> >> Celeborn (LifecycleManager) can get the full picture of remaining
> >> >> shuffle
> >> >> >> data from previous stage attempt and reuse it in stage recompute
> >> >> >> , and the whole process will be transparent to Spark/DagScheduler
> >> >> >>
> >> >> >
> >> >> > Celeborn does not have visibility into this - and this is
> potentially
> >> >> > subject to invasive changes in Apache Spark as it evolves.
> >> >> > For example, I recently merged a couple of changes which would make
> >> this
> >> >> > different in master compared to previous versions.
> >> >> > Until the remote shuffle service SPIP is implemented and these are
> >> >> > abstracted out & made pluggable, it will continue to be quite
> >> volatile.
> >> >> >
> >> >> > Note that the behavior for 3.5 and older is known - since Spark
> >> versions
> >> >> > have been released - it is the behavior in master and future
> >> versions of
> >> >> > Spark which is subject to change.
> >> >> > So delivering on SPARK-25299 would future proof all remote shuffle
> >> >> > implementations.
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > Mridul
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Per my perspective, leveraging partial stage recompute and
> >> >> >> remaining shuffle data needs a lot of work to do in Celeborn
> >> >> >> I prefer to implement a simple whole stage recompute first with
> >> >> interface
> >> >> >> defined with recomputeAll = true flag, and explore partial stage
> >> >> recompute
> >> >> >> in seperate ticket as future optimization
> >> >> >> How do you think about it?
> >> >> >>
> >> >> >> Regards,
> >> >> >> Erik
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Oct 14, 2023 at 4:50 PM Mridul Muralidharan <
> >> mri...@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >>>
> >> >> >>>
> >> >> >>> On Sat, Oct 14, 2023 at 3:49 AM Mridul Muralidharan <
> >> mri...@gmail.com
> >> >> >
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>>>
> >> >> >>>> A reducer oriented view of shuffle, especially without
> >> replication,
> >> >> >>>> could indeed be susceptible to this issue you described (a
> single
> >> >> fetch
> >> >> >>>> failure would require all mappers to need to be recomputed) -
> >> note,
> >> >> not
> >> >> >>>> necessarily all reducers to be recomputed though.
> >> >> >>>>
> >> >> >>>> Note that I have not looked much into Celeborn specifically on
> >> this
> >> >> >>>> aspect yet, so my comments are *fairly* centric to Spark
> internals
> >> >> :-)
> >> >> >>>>
> >> >> >>>> Regards,
> >> >> >>>> Mridul
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Sat, Oct 14, 2023 at 3:36 AM Sungwoo Park <glap...@gmail.com
> >
> >> >> wrote:
> >> >> >>>>
> >> >> >>>>> Hello,
> >> >> >>>>>
> >> >> >>>>> (Sorry for sending the same message again.)
> >> >> >>>>>
> >> >> >>>>> From my understanding, the current implementation of Celeborn
> >> makes
> >> >> it
> >> >> >>>>> hard to find out which mapper should be re-executed when a
> >> >> partition cannot
> >> >> >>>>> be read, and we should re-execute all the mappers in the
> upstream
> >> >> stage. If
> >> >> >>>>> we can find out which mapper/partition should be re-executed,
> the
> >> >> current
> >> >> >>>>> logic of stage recomputation could be (partially or totally)
> >> reused.
> >> >> >>>>>
> >> >> >>>>> Regards,
> >> >> >>>>>
> >> >> >>>>> --- Sungwoo
> >> >> >>>>>
> >> >> >>>>> On Sat, Oct 14, 2023 at 5:24 PM Mridul Muralidharan <
> >> >> mri...@gmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>>
> >> >> >>>>>>
> >> >> >>>>>> Hi,
> >> >> >>>>>>
> >> >> >>>>>>   Spark will try to minimize the recomputation cost as much as
> >> >> >>>>>> possible.
> >> >> >>>>>> For example, if parent stage was DETERMINATE, it simply needs
> to
> >> >> >>>>>> recompute the missing (mapper) partitions (which resulted in
> >> fetch
> >> >> >>>>>> failure). Note, this by itself could require further
> >> recomputation
> >> >> in the
> >> >> >>>>>> DAG if the inputs required to comput the parent partitions are
> >> >> missing, and
> >> >> >>>>>> so on - so it is dynamic.
> >> >> >>>>>>
> >> >> >>>>>> Regards,
> >> >> >>>>>> Mridul
> >> >> >>>>>>
> >> >> >>>>>> On Sat, Oct 14, 2023 at 2:30 AM Sungwoo Park <
> >> >> o...@pl.postech.ac.kr>
> >> >> >>>>>> wrote:
> >> >> >>>>>>
> >> >> >>>>>>> > a) If one or more tasks for a stage (and so its shuffle id)
> >> is
> >> >> >>>>>>> going to be
> >> >> >>>>>>> > recomputed, if it is an INDETERMINATE stage, all shuffle
> >> output
> >> >> >>>>>>> will be
> >> >> >>>>>>> > discarded and it will be entirely recomputed (see here
> >> >> >>>>>>> > <
> >> >> >>>>>>>
> >> >>
> >>
> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1477
> >> >> >>>>>>> >
> >> >> >>>>>>> > ).
> >> >> >>>>>>>
> >> >> >>>>>>> If a reducer (in a downstream stage) fails to read data, can
> we
> >> >> find
> >> >> >>>>>>> out
> >> >> >>>>>>> which tasks should recompute their output? From the previous
> >> >> >>>>>>> discussion, I
> >> >> >>>>>>> thought this was hard (in the current implementation), and we
> >> >> should
> >> >> >>>>>>> re-execute all tasks in the upstream stage.
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks,
> >> >> >>>>>>>
> >> >> >>>>>>> --- Sungwoo
> >> >> >>>>>>>
> >> >> >>>>>>
> >> >>
> >> >
> >>
> >
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Reply via email to