[jira] [Commented] (NIFI-4715) ListS3 list duplicate files when incoming file throughput to S3 is high

Joseph Witt (JIRA) Tue, 19 Dec 2017 19:47:16 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297848#comment-16297848
 ]


Joseph Witt commented on NIFI-4715:
-----------------------------------

[~dmilan77]  ListS3 is annotated with TriggerSerially.  This means it can only 
ever have one thread.  It is designed to be run this way exclusively.  When you 
say run in multi-threaded mode are you saying you're able to have it run with 
more than one thread?  Can you share a screen shot.

It is designed to be single threaded for the listing then the listing results 
can be sent around the cluster via S2S protocol and Fetched in parallel.  This 
List/Fetch pattern is extremely common now for massive scale flows.

Please confirm whether there is a bug or a misunderstanding of how it works.

> ListS3 list  duplicate files when incoming file throughput to S3 is high
> ------------------------------------------------------------------------
>
>                 Key: NIFI-4715
>                 URL: https://issues.apache.org/jira/browse/NIFI-4715
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>         Environment: All
>            Reporter: Milan Das
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NIFI-4715) ListS3 list duplicate files when incoming file throughput to S3 is high

Reply via email to