venkata91 commented on PR #3147:
URL: https://github.com/apache/celeborn/pull/3147#issuecomment-2807072781

   @SteNicholas @reswqa Let me add a bit more context on this issue as well as 
about LI environment itself.
   
   I understand where you're coming from, which is basically for transient 
worker upgrade issues recomputing the missing partitions from the upstream 
vertex is a bit costly.
   
   But there are 2 different cases where a Celeborn worker (or host) can be 
down.
   1. _Transient_ cases (similar to what you mentioned, Celeborn worker 
software upgrade etc):
   - Data/metadata is still intact and I assume these are expected to be very 
quick.
   2. _Permanent_ / longer maintenance cases (both planned and unplanned)
   - In these cases as well this will result in `Failed to connect to <host>` 
but the Celeborn worker (or host) won't be able to recover. We typically have 
this planned maintenance where a host is taken down for software upgrades etc. 
And then there are also the unplanned maintenance issues. 
   - Client cannot recover even with retries (note: AFAIK Flink doesn't support 
`replication` at this point, replication can alleviate this issue further but 
cannot fully solve it) in these cases. For these cases, wouldn't it make sense 
to throw `PartitionConnectionException` and retry the portion of the upstream 
tasks? At least this way, the job will recover and can complete successfully. 
Otherwise, every task would fail with the same exception (after retries) and 
eventually exhausting task failure retries on the Flink side.
   
   I see the trade off here. We want to minimize the re-computation cost 
(possibly retry on the client side with exponential backoff with a max time 
out) in the cases of transient loss of worker and still improve the fault 
tolerance of Flink with Celeborn by retrying the portion of upstream tasks for 
the permanent / longer maintenance window cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to