rkhachatryan commented on pull request #15846:
URL: https://github.com/apache/flink/pull/15846#issuecomment-839646729


   Thanks for reviewing @akalash 
   
   > I mean it is definitely fine to restart the job if it has a transient 
error on recovery but why we have the same behavior when corruption happened.
   
   Interesting idea. However, I have some concerns:
   1. I don't see how the behavior could differ for corruption vs transient 
failure cases. Probably failing the job immediately? The benefit would be saved 
restart attempts, but I doubt that it worths extra complexity.
   1. I'm not sure whether corruption itself can not be transient (e.g. if one 
DFS replica was damaged, but subsequent requests arrive to healthy ones). So 
retry would still make sense.
   1. Here, we can only validate a small portion of a checkpoint (`_metadata`). 
The most part is loaded later on TMs.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to