GitHub user ijokarumawak opened a pull request:

    https://github.com/apache/nifi/pull/3116

    NIFI-4715: ListS3 produces duplicates in frequently updated buckets

    This PR is based on #2361. To preserve @adamlamar's credit, please do not 
squash the first commit when merging. Thanks!
    
    The 2nd commit avoids updating `currentKeys` during the listing loop. 
Before this fix, it's easy to reproduce duplicated list with a small number of 
objects. E.g 10 objects to S3 uploaded at the same time, ListS3 can produce 27 
FlowFiles. Using min age doesn't address the issue.
    
    Please use [the template file attached to the 
JIRA](https://issues.apache.org/jira/secure/attachment/12946341/ListS3_Duplication.xml)
 to reproduce.
    
    After applying this fix, I confirmed ListS3 can produce FlowFiles without 
duplication. I tested 10,000 objects were listed without duplication while 
those were uploaded by PutS3 and listed by ListS3 simultaneously.
    
    
    ---
    
    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
    
    - [ ] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
    - [ ] Have you written or updated unit tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
    - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
    - [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ijokarumawak/nifi nifi-4715

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/3116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3116
    
----
commit 3853b13806121c1479edac2038634992ffc6fdfe
Author: Adam Lamar <adamonduty@...>
Date:   2017-12-24T03:29:02Z

    NIFI-4715: ListS3 produces duplicates in frequently updated buckets
    
    Keep totalListCount, reduce unnecessary persistState
    
    This closes #2361.
    
    Signed-off-by: Koji Kawamura <[email protected]>

commit 4d445055cf605811f85bfed12b33155adbd570a2
Author: Koji Kawamura <ijokarumawak@...>
Date:   2018-10-31T07:01:36Z

    NIFI-4715: Update currentKeys after listing loop
    
    ListS3 used to update currentKeys within listing loop, that causes
    duplicates. Because S3 returns object list in lexicographic order, if we
    clear currentKeys during the loop, we cannot tell if the object has been
    listed or not, in a case where newer object has a lexicographically
    former name.

----


---

Reply via email to