[
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302646#comment-16302646
]
Adam Lamar commented on NIFI-4715:
----------------------------------
[~dmilan77] I was able to reproduce the issue you described. ListS3
unnecessarily produces duplicates when two or more files are uploaded during
the same millisecond. To reproduce, use a ListS3 processor listing regularly (1
sec) and a PutS3Object processor with multiple threads (I used 10). After a few
hundred PUTs, ListS3 will list more objects than PutS3Object has uploaded. I
wasn't able to reproduce the same behavior when PutS3Object is configured with
a single thread.
Initializing `maxTimestamp` to `currentTimestamp` fixes the issue because it
allows the condition at line 260 to add the current key to `currentKeys`. Note
that `currentKeys` is only used to determine which files have been listed
during the current millisecond (not the whole bucket lifetime), so its normal
for the set to be cleared when `lastModified` changes.
I plan to submit a PR that should fix the issue shortly. Thanks for reporting
this issue!
> ListS3 list duplicate files when incoming file throughput to S3 is high
> ------------------------------------------------------------------------
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
> Reporter: Milan Das
> Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When
> ListS3 operates in multi threaded mode, sometimes it tries to list same
> file from S3 bucket. Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is:
> When the file gets uploaded to S3 simultaneously when List S3 is in progress.
> onTrigger--> maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed
> key it should be skipped. As the key is cleared, it is loading the same file
> again.
> I think fix should be to initiate the maxTimestamp with currentTimestamp not
> 0L.
> {code}
> long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
> if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)