[
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Milan Das updated NIFI-4715:
----------------------------
Description:
ListS3 state is implemented using HashSet. HashSet is not thread safe. When
ListS3 operates in multi threaded mode, sometimes it tries to list same file
from S3 bucket. Seems like HashSet data is getting corrupted.
currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
currentKeys = //ConcurrentHashMap.newKeySet();
*{color:red}+Update+{color}*:
This is not a HashSet issue:
Root cause is:
When the file gets uploaded to S3 simultaneously when List S3 is in progress.
onTrigger--> maxTimestamp is initiated as 0L.
This is clearing keys as per the code below
When lastModifiedTime on S3 object is same as currentTimestamp for the listed
key it should be skipped. As the key is cleared, it is loading the same file
again.
I think fix should be to initiate the maxTimestamp with currentTimestamp not 0L.
{code}
long maxTimestamp = currentTimestamp;
{code}
Following block is clearing keys.
{code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
if (lastModified > maxTimestamp) {
maxTimestamp = lastModified;
currentKeys.clear();
getLogger().debug("clearing keys");
}
{code}
Update: 01/03/2018
There is one more flavor of same defect.
Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp =
1514987311000 on state.
1. File will be picked up time current state will be updated to
currentTimestamp=1514987311000 (but OS System time is 1514987611000)
2. next cycle for file2 with lastmodified: 1514987611000 : keys will be cleared
because lastModified > maxTimeStamp (=currentTimestamp=1514987311000).
CurrentTimeStamp will saved as 1514987611000
3. next cycle: currentTimestamp=1514987611000 , "file1 modified at
1514987611000" will be picked up again because file1 is no longer in the keys.
I think solution is currentTimeStamp need to persisted current system time
stamp.
was:
ListS3 state is implemented using HashSet. HashSet is not thread safe. When
ListS3 operates in multi threaded mode, sometimes it tries to list same file
from S3 bucket. Seems like HashSet data is getting corrupted.
currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
currentKeys = //ConcurrentHashMap.newKeySet();
*{color:red}+Update+{color}*:
This is not a HashSet issue:
Root cause is:
When the file gets uploaded to S3 simultaneously when List S3 is in progress.
onTrigger--> maxTimestamp is initiated as 0L.
This is clearing keys as per the code below
When lastModifiedTime on S3 object is same as currentTimestamp for the listed
key it should be skipped. As the key is cleared, it is loading the same file
again.
I think fix should be to initiate the maxTimestamp with currentTimestamp not 0L.
{code}
long maxTimestamp = currentTimestamp;
{code}
Following block is clearing keys.
{code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
if (lastModified > maxTimestamp) {
maxTimestamp = lastModified;
currentKeys.clear();
getLogger().debug("clearing keys");
}
{code}
> ListS3 produces duplicates in frequently updated buckets
> --------------------------------------------------------
>
> Key: NIFI-4715
> URL: https://issues.apache.org/jira/browse/NIFI-4715
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Environment: All
> Reporter: Milan Das
> Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When
> ListS3 operates in multi threaded mode, sometimes it tries to list same
> file from S3 bucket. Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is:
> When the file gets uploaded to S3 simultaneously when List S3 is in progress.
> onTrigger--> maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed
> key it should be skipped. As the key is cleared, it is loading the same file
> again.
> I think fix should be to initiate the maxTimestamp with currentTimestamp not
> 0L.
> {code}
> long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
> if (lastModified > maxTimestamp) {
> maxTimestamp = lastModified;
> currentKeys.clear();
> getLogger().debug("clearing keys");
> }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp =
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be
> cleared because lastModified > maxTimeStamp
> (=currentTimestamp=1514987311000).
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time
> stamp.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)