weiqingy commented on PR #828:
URL: https://github.com/apache/flink-agents/pull/828#issuecomment-4677267875

   > LGTM.
   > 
   > I tried to add an e2e case to truly verify that recovery from a checkpoint 
fails before the fix and is resolved after it. However, in the MiniCluster, 
triggering an in-place recovery by throwing an exception does not recreate the 
JVM, so the problematic code path is never exercised. I think we can, after 
#708 merge, add a case that manually kills the TM process in a standalone 
cluster to trigger recovery, and use that to verify this fix.
   
   @wenjin272 Thanks for digging into the e2e angle. That matches what we found 
while scoping this — local/MiniCluster mode never crosses Pemja, so the SIGSEGV 
path can't be reproduced there, which is why the unit tests assert the stored 
form is recursively primitive as a checkpoint-safety proxy instead of driving a 
real restore. Killing the TM process in a standalone cluster after #708 is 
exactly the right way to get true before/after verification. I filed #836 to 
track it so it doesn't get lost.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to