GitHub user ijokarumawak opened a pull request:
https://github.com/apache/nifi/pull/3116
NIFI-4715: ListS3 produces duplicates in frequently updated buckets
This PR is based on #2361. To preserve @adamlamar's credit, please do not
squash the first commit when merging. Thanks!
The 2nd commit avoids updating `currentKeys` during the listing loop.
Before this fix, it's easy to reproduce duplicated list with a small number of
objects. E.g 10 objects to S3 uploaded at the same time, ListS3 can produce 27
FlowFiles. Using min age doesn't address the issue.
Please use [the template file attached to the
JIRA](https://issues.apache.org/jira/secure/attachment/12946341/ListS3_Duplication.xml)
to reproduce.
After applying this fix, I confirmed ListS3 can produce FlowFiles without
duplication. I tested 10,000 objects were listed without duplication while
those were uploaded by PutS3 and listed by ListS3 simultaneously.
---
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
- [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number
you are trying to resolve? Pay particular attention to the hyphen "-" character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Is your initial contribution a single, squashed commit?
### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the LICENSE file, including the main
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to
.name (programmatic access) for each of the new properties?
### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in
which it is rendered?
### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ijokarumawak/nifi nifi-4715
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nifi/pull/3116.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3116
----
commit 3853b13806121c1479edac2038634992ffc6fdfe
Author: Adam Lamar <adamonduty@...>
Date: 2017-12-24T03:29:02Z
NIFI-4715: ListS3 produces duplicates in frequently updated buckets
Keep totalListCount, reduce unnecessary persistState
This closes #2361.
Signed-off-by: Koji Kawamura <[email protected]>
commit 4d445055cf605811f85bfed12b33155adbd570a2
Author: Koji Kawamura <ijokarumawak@...>
Date: 2018-10-31T07:01:36Z
NIFI-4715: Update currentKeys after listing loop
ListS3 used to update currentKeys within listing loop, that causes
duplicates. Because S3 returns object list in lexicographic order, if we
clear currentKeys during the loop, we cannot tell if the object has been
listed or not, in a case where newer object has a lexicographically
former name.
----
---