[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

ASF GitHub Bot (JIRA) Sun, 31 Dec 2017 00:16:08 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307146#comment-16307146
 ]


ASF GitHub Bot commented on NIFI-4715:
--------------------------------------

Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @adamlamar I tried to find the API documentation but couldn't find the 
exact statement. Thanks, now it does makes sense to defer updating 
`currentTimestamp` after all listed entities are examined.
    
    Actually that makes me want to ask another related question. Do we risk 
making duplication by updating `currentKeys` in the middle of the loop?
    
    ## Simulation
    
    For example, ListS3 listed following entities at the 1st onTrigger. 
    
    ### The 1st onTrigger simulation at t1
    
    This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    
    ### The 2nd onTrigger simulation at t2
    
    If S3 is being updated at the same time, there may be additional entities 
having the same t1 timestamp, but were not listed at the 1st onTrigger. In such 
case, those will be listed at the next onTrigger. Following table shows the 
expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2 
because it's already listed at t1 and kept in `currentKeys`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    ### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding 
in lexicographical order
    
    Based on the above scenario, let's think about an edge case. New entries 
having later lastModified timestamp can be added at the same time at the time 
of the 2nd onTrigger ran. This might break the current implementation that 
updates `currentKeys` in the middle of the loop because entities are returned 
in lexicographical order. What if there was `a2.txt` having later timestamp 
than t1?
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |a2.txt|t2|t2|
    |b1.txt|t1|t1 and *t2*|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    With current implementation, `b1.txt` would be listed again at t2.
    
    ## Suggestion
    
    - Like `maxTimestamp` (representing the latest timestamp at current 
onTrigger) and `currentTimestamp` (representing the latest timestamp at the 
last onTrigger), use separate variables to track the keys having the latest 
timestamp at the last run and current run.
    - Probably renaming variables would make code more readable. 
    - Update only `maxTimestamp` and the keys with the latest timestamp of 
current iteration inside the loop, leave the variables which tracks the 
previous onTrigger state as it is. Then, after the loop, update the variables 
to track `previous` onTrigger state.
    
    Above approach would reflect background better, and also provide cleaner 
easily understandable code. I may be overly concerning details, but am feeling 
this can be better a bit more. Thanks for your patience!


> ListS3 produces duplicates in frequently updated buckets
> --------------------------------------------------------
>
>                 Key: NIFI-4715
>                 URL: https://issues.apache.org/jira/browse/NIFI-4715
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>         Environment: All
>            Reporter: Milan Das
>         Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
>                     maxTimestamp = lastModified;
>                     currentKeys.clear();
>                     getLogger().debug("clearing keys");
>                 }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NIFI-4715) ListS3 produces duplicates in frequently updated buckets

Reply via email to