Following are the main steps for a shuffle stage:
1. LifecycleManager sends RequestSlots to Master to request slots for the
current shuffle;
2. Master allocates slots among workers for the shuffle and
returns RequestSlotsResponse;
3. LifecycleManager sends ReserveSlots to workers; workers do
initialization;
4. ShuffleClient pushes data to workers;
5. When map task ends, ShuffleClient sends MapperEnd to LifecycleManager;
6. When all map tasks ended, LifecycleManager sends CommitFiles to workers;
7. When CommitFiles succeeds, reducer tasks can read data from workers.
Hello,
Is there some way to use Celeborn API to check if CommitFiles succeeds in
step 6? Currently we are testing with TPC-DS 10TB data, and some heavy
query (query 24) occasionally fails with:
Caused by: java.io.IOException: Premature EOF from inputStream
We are speculating that this error occurs because we miss the check in
step 6.
Thanks,
--- Sungwoo