rkhachatryan commented on pull request #15846: URL: https://github.com/apache/flink/pull/15846#issuecomment-839646729
Thanks for reviewing @akalash > I mean it is definitely fine to restart the job if it has a transient error on recovery but why we have the same behavior when corruption happened. Interesting idea. However, I have some concerns: 1. I don't see how the behavior could differ for corruption vs transient failure cases. Probably failing the job immediately? The benefit would be saved restart attempts, but I doubt that it worths extra complexity. 1. I'm not sure whether corruption itself can not be transient (e.g. if one DFS replica was damaged, but subsequent requests arrive to healthy ones). So retry would still make sense. 1. Here, we can only validate a small portion of a checkpoint (`_metadata`). The most part is loaded later on TMs. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
