GitHub user ymost opened a pull request:
https://github.com/apache/flink/pull/4468
[FLINK-7347] [streaming] Keep ids for current checkpoint in a set instead
of a list
JIRA issue: FLINK-7347
## What is the purpose of the change
This pull request changes the data structure used to store
acknowledge-pending message ids from an `ArrayList` to a `HashSet`. This is
done to eliminate an extremely inefficient call to the `removeAll` method in
`MessageAcknowledgingSourceBase.notifyCheckpointComplete`.
The implementation of `removeAll` is such that if the set is smaller than
the collection to remove, then the set is iterated and every item is checked
for containment in the collection. The `contains` action on an `ArrayList` is
very inefficient, and it is performed for every item the set.
In our pipeline we had about 10 million events processed, and the
checkpoint was stuck on the `removeAll` call for hours.
## Brief change log
Keep ids for current checkpoint in a set instead of a list
## Verifying this change
This change is already covered by existing tests, such as `RMQSourceTest`.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): **no**
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: **no**
- The serializers: **no**
- The runtime per-record code paths (performance sensitive): **yes** -
performance should be improved
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes** - this should
resolve the problem where checkpoints would get stuck on the call to `removeAll`
## Documentation
- Does this pull request introduce a new feature? **no**
- If yes, how is the feature documented? **not applicable**
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ymost/flink FLINK-7347-checkpoint-ids-set
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4468.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4468
----
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---