[
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307073#comment-16307073
]
ASF GitHub Bot commented on NIFI-4715:
--------------------------------------
Github user adamlamar commented on the issue:
https://github.com/apache/nifi/pull/2361
@ijokarumawak From the <a
href="https://docs.aws.amazon.com/AmazonS3/latest/API/v2-RESTBucketGET.html#v2-RESTBucketGET-requests">AWS
S3 API documentation</a> (see the `continuation-token` section):
> Amazon S3 lists objects in UTF-8 character encoding in lexicographical
order
I really wish we could take the approach you suggested (would certainly
make things easier), but since the entries are in lexicographical/alphabetical
order, we must iterate over the entire listing before updating
`currentTimestamp`. Otherwise we risk skipping keys newer than
`currentTimestamp` but older than keys in the middle of the list. The
lexicographical ordering also matches my experience when using the API.
Unfortunately this does also mean that duplicates are possible when a
listing fails, like the `IOException` scenario you mentioned. This is an
existing limitation in ListS3.
I appreciate your help getting this reviewed! :)
> ListS3 produces duplicates in frequently updated buckets
> --------------------------------------------------------
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
> Reporter: Milan Das
> Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When
> ListS3 operates in multi threaded mode, sometimes it tries to list same
> file from S3 bucket. Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is:
> When the file gets uploaded to S3 simultaneously when List S3 is in progress.
> onTrigger--> maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed
> key it should be skipped. As the key is cleared, it is loading the same file
> again.
> I think fix should be to initiate the maxTimestamp with currentTimestamp not
> 0L.
> {code}
> long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
> if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)