[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302646#comment-16302646
 ] 

Adam Lamar commented on NIFI-4715:
----------------------------------

[~dmilan77] I was able to reproduce the issue you described. ListS3 
unnecessarily produces duplicates when two or more files are uploaded during 
the same millisecond. To reproduce, use a ListS3 processor listing regularly (1 
sec) and a PutS3Object processor with multiple threads (I used 10). After a few 
hundred PUTs, ListS3 will list more objects than PutS3Object has uploaded. I 
wasn't able to reproduce the same behavior when PutS3Object is configured with 
a single thread.

Initializing `maxTimestamp` to `currentTimestamp` fixes the issue because it 
allows the condition at line 260 to add the current key to `currentKeys`. Note 
that `currentKeys` is only used to determine which files have been listed 
during the current millisecond (not the whole bucket lifetime), so its normal 
for the set to be cleared when `lastModified` changes.

I plan to submit a PR that should fix the issue shortly. Thanks for reporting 
this issue!

> ListS3 list  duplicate files when incoming file throughput to S3 is high
> ------------------------------------------------------------------------
>
>                 Key: NIFI-4715
>                 URL: https://issues.apache.org/jira/browse/NIFI-4715
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>         Environment: All
>            Reporter: Milan Das
>         Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
>                     maxTimestamp = lastModified;
>                     currentKeys.clear();
>                     getLogger().debug("clearing keys");
>                 }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to