[ https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302646#comment-16302646 ]
Adam Lamar commented on NIFI-4715: ---------------------------------- [~dmilan77] I was able to reproduce the issue you described. ListS3 unnecessarily produces duplicates when two or more files are uploaded during the same millisecond. To reproduce, use a ListS3 processor listing regularly (1 sec) and a PutS3Object processor with multiple threads (I used 10). After a few hundred PUTs, ListS3 will list more objects than PutS3Object has uploaded. I wasn't able to reproduce the same behavior when PutS3Object is configured with a single thread. Initializing `maxTimestamp` to `currentTimestamp` fixes the issue because it allows the condition at line 260 to add the current key to `currentKeys`. Note that `currentKeys` is only used to determine which files have been listed during the current millisecond (not the whole bucket lifetime), so its normal for the set to be cleared when `lastModified` changes. I plan to submit a PR that should fix the issue shortly. Thanks for reporting this issue! > ListS3 list duplicate files when incoming file throughput to S3 is high > ------------------------------------------------------------------------ > > Key: NIFI-4715 > URL: https://issues.apache.org/jira/browse/NIFI-4715 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 1.2.0, 1.3.0, 1.4.0 > Environment: All > Reporter: Milan Das > Attachments: List-S3-dup-issue.xml, screenshot-1.png > > > ListS3 state is implemented using HashSet. HashSet is not thread safe. When > ListS3 operates in multi threaded mode, sometimes it tries to list same > file from S3 bucket. Seems like HashSet data is getting corrupted. > currentKeys = new HashSet<>(); // need to be implemented Thread Safe like > currentKeys = //ConcurrentHashMap.newKeySet(); > *{color:red}+Update+{color}*: > This is not a HashSet issue: > Root cause is: > When the file gets uploaded to S3 simultaneously when List S3 is in progress. > onTrigger--> maxTimestamp is initiated as 0L. > This is clearing keys as per the code below > When lastModifiedTime on S3 object is same as currentTimestamp for the listed > key it should be skipped. As the key is cleared, it is loading the same file > again. > I think fix should be to initiate the maxTimestamp with currentTimestamp not > 0L. > {code} > long maxTimestamp = currentTimestamp; > {code} > Following block is clearing keys. > {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid} > if (lastModified > maxTimestamp) { > maxTimestamp = lastModified; > currentKeys.clear(); > getLogger().debug("clearing keys"); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)