[GitHub] [flink] akalash commented on pull request #15846: [FLINK-22502][checkpointing] Don't tolerate checkpoint retrieval failures on recovery

GitBox Wed, 12 May 2021 09:01:35 -0700


akalash commented on pull request #15846:
URL: https://github.com/apache/flink/pull/15846#issuecomment-839892112



   > I don't see how the behavior could differ for corruption vs transient 
failure cases. Probably failing the job immediately? The benefit would be saved 
restart attempts, but I doubt that it worths extra complexity.
   
   Yes, I thought about failing the job immediately. I agree that it will add 
extra complexity and it is not for this task. I was most interesting it was 
already discussed or not. 
   
   > I'm not sure whether corruption itself can not be transient (e.g. if one 
DFS replica was damaged, but subsequent requests arrive to healthy ones). So 
retry would still make sense.
   
   It is an interesting point that corruption can not be detected for sure. Of 
course, if there is no mechanism to detect corruption(CRC or similar), then it 
doesn't make sense to try to guess it is transient fail or not. (But perhaps, 
deserializer can help with that - if we have enough length of bytes but don't 
able to deserialize it I don't think that it can be transient problem)
   
   
   Again, I am quite satisfied with this task.  For now, I agree that perhaps 
it doesn't make sense to distinguish a corruption and a transient problem. But 
I will think about it separately from this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] akalash commented on pull request #15846: [FLINK-22502][checkpointing] Don't tolerate checkpoint retrieval failures on recovery

Reply via email to