Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

orpl Wed, 27 Sep 2023 20:12:03 -0700

Hello,

As we are developing MR3 extension for Celeborn, I would like to add mycomments on stage re-run in the context of using Celeborn for MR3. I don'tknow the internal details of Spark stage re-run very well, so my apologyif my comments are irrelevant to the proposal in the design document.


For Celeborn-MR3, we only need the following two features:

1. When mapper out is lost and read errors occur, CelebornIOException fromShuffleClientImpl includes the task index of the mapper (or a set of taskindexes) whose output has been lost.

2. ShuffleClientImpl notifies LifeCycleManager so thatShuffleClient.mapperEnd(shuffleId, mapper_task_index, ...) can be calledagain. In other words, LifeCycleManager markes shuffleId from 'complete'back to 'incomplete'.

Then, the task-rexecution mechanism of MR3 can take care of the rest, byre-executing the mapper and calling ShuffleClient.mapperEnd() again.

From the proposal (if I understood it correctly), however, it seems that 1

is not easy to implement in the current architecture of Celeborn (???):

Celeborn doesn't know which mapper tasks need to be recomputed, unless themapping of parititionId -> List<mapId> is recorded and reported toLifeCycleManager at committing time.

By the way, we have finished the initial implementation ofHive-MR3-Celeborn, and it works very reliably when tested with TPC-DS 10TBand the performance is also good. A release candidate is currently beingtested in production by a third parity. It could take a bit of time tolearn to use Hive-MR3-Celeborn, but Hive-MR3-Celeborn could be another wayto run stress tests on Celeborn. For example, we produced the EOFExceptionerror when running stress tests by using speculative execution a lot andintentionally giving heavy memory pressure. (We have quick start guidesfor Hadoop, K8s, standalone mode, so it should take no more than a coupleof hours to learn to run Hive-MR3-Celeborn.) If you are interested, pleaselet me know. Thank you.


Best,

-- Sungwoo

On Fri, 22 Sep 2023, Erik fang wrote:

Hi folks,

I have a proposal to implement Spark stage resubmission to handle shuffle
fetch failure in Celeborn

https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8

please have a look and let me know what you think

Regards,
Erik

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Reply via email to