weiqingy commented on PR #828: URL: https://github.com/apache/flink-agents/pull/828#issuecomment-4677267875
> LGTM. > > I tried to add an e2e case to truly verify that recovery from a checkpoint fails before the fix and is resolved after it. However, in the MiniCluster, triggering an in-place recovery by throwing an exception does not recreate the JVM, so the problematic code path is never exercised. I think we can, after #708 merge, add a case that manually kills the TM process in a standalone cluster to trigger recovery, and use that to verify this fix. @wenjin272 Thanks for digging into the e2e angle. That matches what we found while scoping this — local/MiniCluster mode never crosses Pemja, so the SIGSEGV path can't be reproduced there, which is why the unit tests assert the stored form is recursively primitive as a checkpoint-safety proxy instead of driving a real restore. Killing the TM process in a standalone cluster after #708 is exactly the right way to get true before/after verification. I filed #836 to track it so it doesn't get lost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
