[
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669697#comment-16669697
]
ASF GitHub Bot commented on NIFI-4715:
--------------------------------------
GitHub user ijokarumawak opened a pull request:
https://github.com/apache/nifi/pull/3116
NIFI-4715: ListS3 produces duplicates in frequently updated buckets
This PR is based on #2361. To preserve @adamlamar's credit, please do not
squash the first commit when merging. Thanks!
The 2nd commit avoids updating `currentKeys` during the listing loop.
Before this fix, it's easy to reproduce duplicated list with a small number of
objects. E.g 10 objects to S3 uploaded at the same time, ListS3 can produce 27
FlowFiles. Using min age doesn't address the issue.
Please use [the template file attached to the
JIRA](https://issues.apache.org/jira/secure/attachment/12946341/ListS3_Duplication.xml)
to reproduce.
After applying this fix, I confirmed ListS3 can produce FlowFiles without
duplication. I tested 10,000 objects were listed without duplication while
those were uploaded by PutS3 and listed by ListS3 simultaneously.
---
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
- [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number
you are trying to resolve? Pay particular attention to the hyphen "-" character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Is your initial contribution a single, squashed commit?
### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the LICENSE file, including the main
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to
.name (programmatic access) for each of the new properties?
### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in
which it is rendered?
### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ijokarumawak/nifi nifi-4715
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nifi/pull/3116.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3116
----
commit 3853b13806121c1479edac2038634992ffc6fdfe
Author: Adam Lamar <adamonduty@...>
Date: 2017-12-24T03:29:02Z
NIFI-4715: ListS3 produces duplicates in frequently updated buckets
Keep totalListCount, reduce unnecessary persistState
This closes #2361.
Signed-off-by: Koji Kawamura <[email protected]>
commit 4d445055cf605811f85bfed12b33155adbd570a2
Author: Koji Kawamura <ijokarumawak@...>
Date: 2018-10-31T07:01:36Z
NIFI-4715: Update currentKeys after listing loop
ListS3 used to update currentKeys within listing loop, that causes
duplicates. Because S3 returns object list in lexicographic order, if we
clear currentKeys during the loop, we cannot tell if the object has been
listed or not, in a case where newer object has a lexicographically
former name.
----
> ListS3 produces duplicates in frequently updated buckets
> --------------------------------------------------------
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
> Reporter: Milan Das
> Assignee: Koji Kawamura
> Priority: Major
> Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml,
> screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When
> ListS3 operates in multi threaded mode, sometimes it tries to list same
> file from S3 bucket. Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is:
> When the file gets uploaded to S3 simultaneously when List S3 is in progress.
> onTrigger--> maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed
> key it should be skipped. As the key is cleared, it is loading the same file
> again.
> I think fix should be to initiate the maxTimestamp with currentTimestamp not
> 0L.
> {code}
> long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
> if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp =
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be
> cleared because lastModified > maxTimeStamp
> (=currentTimestamp=1514987311000).
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time
> stamp.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)