Hi Celeborn team,

We are implementing a Celeborn-MR3 client, and have a question on how to
properly unregister a shuffle ID via ShuffleClient. Here is a description of the problem.

1. Suppose that several ShuffleClients are pushing data for a common shuffle ID.

2. For some reason (e.g., a Hive query fails due to OutOfMemoryError or some task fails after several attempts), we decide to interrupt all ShuffleClients.

3. Inside the driver, we call ShuffleClient.unregisterShuffle() with isDriver set to true. Insider MR3 workers, we call ShuffleClient.unregisterShuffle()
with isDriver set to false, as well as ShuffleClient.cleanup().

Outcome:

Insider workers, data push threads continue to run. As a result, the driver
keeps receiving revive requests due to HARD_SPLIT.

Question:

1. Can we stop data push threads (e.g., celeborn-retry-sender-6) when we call ShuffleClient.unregisterShuffle()?

2. What is a correct way of stopping ShuffleClient for a given shuffle ID? In our experiment, the driver prints thousands of revive request, and we are not sure if this is a normal behavior.

Any comment or suggestion will be appreciated very much. Thank you.

--- Sungwoo

Reply via email to