nateab opened a new pull request, #27579:
URL: https://github.com/apache/flink/pull/27579
…eManager during JM failover
## What is the purpose of the change
When a JobManager restarts (common in K8s with spot instances/node drains),
re-uploaded JARs produce `PermanentBlobKey` objects with different random
components despite identical content. The TaskManager's cached classloader
holds the old blob keys, and `BlobLibraryCacheManager.verifyClassLoader()`
throws `IllegalStateException` because the new keys don't match the old ones —
even though the JAR content is identical. This causes an infinite restart loop
since every task deployment attempt hits the same mismatch.
This pull request makes `BlobLibraryCacheManager` handle blob key mismatches
gracefully by catching the `IllegalStateException` and re-creating the
classloader with the new blob keys, instead of propagating the exception.
## Brief change log
- `LibraryCacheEntry.getOrResolveClassLoader()` now catches
`IllegalStateException` from `verifyClassLoader()` and re-creates the
classloader with the new blob keys
- Extracted `createResolvedClassLoader()` helper to eliminate duplication
between initial creation and re-creation paths
- The old classloader is intentionally not closed because in-flight tasks
being cancelled may still reference it
## Verifying this change
This change added tests and can be verified as follows:
- Added `classloaderIsRecreatedWhenBlobKeysChangeForSameJob`: uploads
identical content twice to produce different `PermanentBlobKey`s (simulating JM
failover), verifies the classloader is transparently re-created
- Added `classloaderRecreationDoesNotCloseOldClassloader`: verifies the
old classloader is not closed during re-creation, since in-flight tasks may
still reference it
- Updated existing tests in `BlobLibraryCacheManagerTest` that previously
expected `IllegalStateException` to verify the new re-creation behavior
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]