Hi Celeborn team,
We are implementing a Celeborn-MR3 client, and have a question on how to
properly unregister a shuffle ID via ShuffleClient. Here is a description
of the problem.
1. Suppose that several ShuffleClients are pushing data for a common
shuffle ID.
2. For some reason (e.g., a Hive query fails due to OutOfMemoryError or
some task fails after several attempts), we decide to interrupt all
ShuffleClients.
3. Inside the driver, we call ShuffleClient.unregisterShuffle() with
isDriver set to true. Insider MR3 workers, we call ShuffleClient.unregisterShuffle()
with isDriver set to false, as well as ShuffleClient.cleanup().
Outcome:
Insider workers, data push threads continue to run. As a result, the driver
keeps receiving revive requests due to HARD_SPLIT.
Question:
1. Can we stop data push threads (e.g., celeborn-retry-sender-6) when we
call ShuffleClient.unregisterShuffle()?
2. What is a correct way of stopping ShuffleClient for a given shuffle ID?
In our experiment, the driver prints thousands of revive request, and we
are not sure if this is a normal behavior.
Any comment or suggestion will be appreciated very much. Thank you.
--- Sungwoo