GitHub user ymost opened a pull request:

    https://github.com/apache/flink/pull/4468

    [FLINK-7347] [streaming] Keep ids for current checkpoint in a set instead 
of a list

    JIRA issue: FLINK-7347
    
    ## What is the purpose of the change
    
    This pull request changes the data structure used to store 
acknowledge-pending message ids from an `ArrayList` to a `HashSet`. This is 
done to eliminate an extremely inefficient call to the `removeAll` method in 
`MessageAcknowledgingSourceBase.notifyCheckpointComplete`.
    The implementation of `removeAll` is such that if the set is smaller than 
the collection to remove, then the set is iterated and every item is checked 
for containment in the collection. The `contains` action on an `ArrayList` is 
very inefficient, and it is performed for every item the set.
    In our pipeline we had about 10 million events processed, and the 
checkpoint was stuck on the `removeAll` call for hours.
    
    
    ## Brief change log
    
    Keep ids for current checkpoint in a set instead of a list
    
    
    ## Verifying this change
    
    This change is already covered by existing tests, such as `RMQSourceTest`.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): **no**
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
      - The serializers: **no**
      - The runtime per-record code paths (performance sensitive): **yes** - 
performance should be improved
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes** - this should 
resolve the problem where checkpoints would get stuck on the call to `removeAll`
    
    ## Documentation
    
      - Does this pull request introduce a new feature? **no**
      - If yes, how is the feature documented? **not applicable**
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ymost/flink FLINK-7347-checkpoint-ids-set

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4468
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to