nateab opened a new pull request, #27516:
URL: https://github.com/apache/flink/pull/27516
## What is the purpose of the change
This pull request adds a secondary HashMap index
(taskExecutorsByInstanceId) to the ResourceManager for O(1) lookups of
WorkerRegistration by InstanceID. Previously, getWorkerByInstanceId() performed
an O(n)
linear scan through all registered TaskExecutors, which could become a
performance bottleneck in clusters with many TaskExecutors. This addresses the
TODO comment in the existing code: "Improve performance by
having an index on the instanceId".
## Brief change log
- Added taskExecutorsByInstanceId HashMap field to ResourceManager for
fast InstanceID lookups
- Initialize the index in the ResourceManager constructor
- Maintain index consistency by updating it alongside the primary
taskExecutors map:
- Add entry when TaskExecutor registers
- Remove old entry when TaskExecutor re-registers (replacement)
- Remove entry when TaskExecutor connection is closed
- Clear index when ResourceManager state is cleared
- Replaced O(n) loop in getWorkerByInstanceId() with O(1) HashMap lookup
## Verifying this change
This change is already covered by existing tests, such as:
- ResourceManagerTaskExecutorTest (6 tests) - covers TaskExecutor
registration, re-registration, and disconnection scenarios
- ResourceManagerTest (14 tests) - covers general ResourceManager
functionality
- ActiveResourceManagerTest (18 tests) - covers the releaseResource() path
which is the primary caller of getWorkerByInstanceId()
All 75 ResourceManager-related tests pass with the changes.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes - ResourceManager
is involved in TaskExecutor registration and resource management,
but this is an internal optimization that does not change behavior
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]