nateab opened a new pull request, #27579:
URL: https://github.com/apache/flink/pull/27579

   …eManager during JM failover
   
   ## What is the purpose of the change
   
   When a JobManager restarts (common in K8s with spot instances/node drains), 
re-uploaded JARs produce `PermanentBlobKey` objects with different random 
components despite identical content. The TaskManager's cached classloader 
holds the old blob keys, and `BlobLibraryCacheManager.verifyClassLoader()` 
throws `IllegalStateException` because the new keys don't match the old ones — 
even though the JAR content is identical. This causes an infinite restart loop 
since every task deployment attempt hits the same mismatch.
   
   This pull request makes `BlobLibraryCacheManager` handle blob key mismatches 
gracefully by catching the `IllegalStateException` and re-creating the 
classloader with the new blob keys, instead of propagating the exception.
   
   ## Brief change log
   
     - `LibraryCacheEntry.getOrResolveClassLoader()` now catches 
`IllegalStateException` from `verifyClassLoader()` and re-creates the 
classloader with the new blob keys
     - Extracted `createResolvedClassLoader()` helper to eliminate duplication 
between initial creation and re-creation paths
     - The old classloader is intentionally not closed because in-flight tasks 
being cancelled may still reference it
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added `classloaderIsRecreatedWhenBlobKeysChangeForSameJob`: uploads 
identical content twice to produce different `PermanentBlobKey`s (simulating JM 
failover), verifies the classloader is transparently re-created
     - Added `classloaderRecreationDoesNotCloseOldClassloader`: verifies the 
old classloader is not closed during re-creation, since in-flight tasks may 
still reference it
     - Updated existing tests in `BlobLibraryCacheManagerTest` that previously 
expected `IllegalStateException` to verify the new re-creation behavior
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to