mridulm commented on PR #3147:
URL: https://github.com/apache/celeborn/pull/3147#issuecomment-2817025176

   @RexXiong, @SteNicholas the scenario being observed is as follows (Venkat 
has given details as well):
   
   When there is a worker loss, currently we observe flink application failures 
as it is not retrying the parent stage to recompute the lost data. In case of 
transient failures due to rolling upgrade, etc - we do have `io.maxRetries` and 
`io.retryWait` as knobs to control behavior.
   
   But when the worker is lost, and not recoverable in reasonable time - given 
Flink does not support replication - it results in repeated retries of the 
task, and eventually application failure.
   
   Wondering if the change is insufficient, and there is a better way to handle 
this scenario.
   If not, is the concern that `io.maxRetries` and `io.retryWait` are 
insufficient to handle rolling upgrade ?
   
   Given this was repeatedly observed for fairly expensive flink batch 
applications, we are trying to find a reliable way to address this issue. 
Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to