waitinfuture opened a new pull request, #2235:
URL: https://github.com/apache/incubator-celeborn/pull/2235
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX]
Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
1. Workers have fixed ports for rpc/push/replicate
2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled
3. graceful shutdown is enabled
I randomly kill -9 some workers (so that graceful shutdown will not work)
and start it immediately.
And I encountered result incorrectness with low probability (1 out of 99
queries).
After digging into it, I found the reason is as follows:
1. At time T1, all workers are serving shuffle 602
2. At time T2, I killed -9 and start worker1 and worker2. Since the workers
are configured with fixed ports,
clients think they are OK and Master let them re-register, which will
also success. And the worker is clean in memory.
4. At time T3, push requests to worker2 fails and revives on worker1, so
worker1 has reservation for shuffle 602
5. At time T4, LifecycleManager sends CommitFiles to all workers, on
worker1, it just logs that some PartitionLocations
don't exist and ignores them.
6. CommitFiles success, but worker1 loses some data before restarting.
The following snapshot shows the process.

This PR fixes this by treating unfound PartitionLocations as failed when
handling CommitFiles.
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]