akalash commented on pull request #15846: URL: https://github.com/apache/flink/pull/15846#issuecomment-839892112
> I don't see how the behavior could differ for corruption vs transient failure cases. Probably failing the job immediately? The benefit would be saved restart attempts, but I doubt that it worths extra complexity. Yes, I thought about failing the job immediately. I agree that it will add extra complexity and it is not for this task. I was most interesting it was already discussed or not. > I'm not sure whether corruption itself can not be transient (e.g. if one DFS replica was damaged, but subsequent requests arrive to healthy ones). So retry would still make sense. It is an interesting point that corruption can not be detected for sure. Of course, if there is no mechanism to detect corruption(CRC or similar), then it doesn't make sense to try to guess it is transient fail or not. (But perhaps, deserializer can help with that - if we have enough length of bytes but don't able to deserialize it I don't think that it can be transient problem) Again, I am quite satisfied with this task. For now, I agree that perhaps it doesn't make sense to distinguish a corruption and a transient problem. But I will think about it separately from this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
