2. You can try to add the shuffle id in
ShuffleClientImpl's stageEndShuffleSet. Currently
ShuffleClient does not have an API like `setStageEnd`, but I think it's
fine to add one. Let me know if you are interested in sending a PR :)

Hi Keyong,

Thanks for the detailed reply. We have decided to use the current API:

1. In our system, workers don't know when a particular stage has ended.
The driver knows when a stage has ended, but it does not notify workers
of the completion of a stage.

2. As you mentioned, sending/receiving Revive requests is not an abnormal behavior. Rather we consider this behavior as part of cleaning up when
the execution of a DAG fails.

Thanks,

--- Sungwoo

BTW, if you are using v0.3.0-incubating, I recommend you to patch the
following PR:
https://github.com/apache/incubator-celeborn/pull/1755
It's related to StageEnd logic.

Thanks,
Keyong Zhou



<[email protected]> 于2023年7月31日周一 10:54?道:

Hi Celeborn team,

We are implementing a Celeborn-MR3 client, and have a question on how to
properly unregister a shuffle ID via ShuffleClient. Here is a description
of the problem.

1. Suppose that several ShuffleClients are pushing data for a common
shuffle ID.

2. For some reason (e.g., a Hive query fails due to OutOfMemoryError or
some task fails after several attempts), we decide to interrupt all
ShuffleClients.

3. Inside the driver, we call ShuffleClient.unregisterShuffle() with
isDriver set to true. Insider MR3 workers, we call
ShuffleClient.unregisterShuffle()
with isDriver set to false, as well as ShuffleClient.cleanup().

Outcome:

Insider workers, data push threads continue to run. As a result, the driver
keeps receiving revive requests due to HARD_SPLIT.

Question:

1. Can we stop data push threads (e.g., celeborn-retry-sender-6) when we
call ShuffleClient.unregisterShuffle()?

2. What is a correct way of stopping ShuffleClient for a given shuffle ID?
In our experiment, the driver prints thousands of revive request, and we
are not sure if this is a normal behavior.

Any comment or suggestion will be appreciated very much. Thank you.

--- Sungwoo


Reply via email to