venkata91 commented on PR #3147: URL: https://github.com/apache/celeborn/pull/3147#issuecomment-2807072781
@SteNicholas @reswqa Let me add a bit more context on this issue as well as about LI environment itself. I understand where you're coming from, which is basically for transient worker upgrade issues recomputing the missing partitions from the upstream vertex is a bit costly. But there are 2 different cases where a Celeborn worker (or host) can be down. 1. _Transient_ cases (similar to what you mentioned, Celeborn worker software upgrade etc): - Data/metadata is still intact and I assume these are expected to be very quick. 2. _Permanent_ / longer maintenance cases (both planned and unplanned) - In these cases as well this will result in `Failed to connect to <host>` but the Celeborn worker (or host) won't be able to recover. We typically have this planned maintenance where a host is taken down for software upgrades etc. And then there are also the unplanned maintenance issues. - Client cannot recover even with retries (note: AFAIK Flink doesn't support `replication` at this point, replication can alleviate this issue further but cannot fully solve it) in these cases. For these cases, wouldn't it make sense to throw `PartitionConnectionException` and retry the portion of the upstream tasks? At least this way, the job will recover and can complete successfully. Otherwise, every task would fail with the same exception (after retries) and eventually exhausting task failure retries on the Flink side. I see the trade off here. We want to minimize the re-computation cost (possibly retry on the client side with exponential backoff with a max time out) in the cases of transient loss of worker and still improve the fault tolerance of Flink with Celeborn by retrying the portion of upstream tasks for the permanent / longer maintenance window cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
