Following are the main steps for a shuffle stage:
1. LifecycleManager sends RequestSlots to Master to request slots for the
current shuffle;
2. Master allocates slots among workers for the shuffle and
returns RequestSlotsResponse;
3. LifecycleManager sends ReserveSlots to workers; workers do
initialization;
4. ShuffleClient pushes data to workers;
5. When map task ends, ShuffleClient sends MapperEnd to LifecycleManager;
6. When all map tasks ended, LifecycleManager sends CommitFiles to workers;
7. When CommitFiles succeeds, reducer tasks can read data from workers.

Hello,

Is there some way to use Celeborn API to check if CommitFiles succeeds in step 6? Currently we are testing with TPC-DS 10TB data, and some heavy query (query 24) occasionally fails with:

  Caused by: java.io.IOException: Premature EOF from inputStream

We are speculating that this error occurs because we miss the check in step 6.

Thanks,

--- Sungwoo

Reply via email to