Github user ijokarumawak commented on the issue:
https://github.com/apache/nifi/pull/2361
@adamlamar I tried to find the API documentation but couldn't find the
exact statement. Thanks, now it does makes sense to defer updating
`currentTimestamp` after all listed entities are examined.
Actually that makes me want to ask another related question. Do we risk
making duplication by updating `currentKeys` in the middle of the loop?
## Simulation
For example, ListS3 listed following entities at the 1st onTrigger.
### The 1st onTrigger simulation at t1
This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`.
|name|last modified|listed at|
|-----|-------------|--------|
|a1.txt|t1 - x|t1|
|b1.txt|t1|t1|
### The 2nd onTrigger simulation at t2
If S3 is being updated at the same time, there may be additional entities
having the same t1 timestamp, but were not listed at the 1st onTrigger. In such
case, those will be listed at the next onTrigger. Following table shows the
expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2
because it's already listed at t1 and kept in `currentKeys`.
|name|last modified|listed at|
|-----|-------------|--------|
|a1.txt|t1 - x|t1|
|b1.txt|t1|t1|
|c1.txt|t1|t2|
|d1.txt|t1|t2|
### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding
in lexicographical order
Based on the above scenario, let's think about an edge case. New entries
having later lastModified timestamp can be added at the same time at the time
of the 2nd onTrigger ran. This might break the current implementation that
updates `currentKeys` in the middle of the loop because entities are returned
in lexicographical order. What if there was `a2.txt` having later timestamp
than t1?
|name|last modified|listed at|
|-----|-------------|--------|
|a1.txt|t1 - x|t1|
|a2.txt|t2|t2|
|b1.txt|t1|t1 and *t2*|
|c1.txt|t1|t2|
|d1.txt|t1|t2|
With current implementation, `b1.txt` would be listed again at t2.
## Suggestion
- Like `maxTimestamp` (representing the latest timestamp at current
onTrigger) and `currentTimestamp` (representing the latest timestamp at the
last onTrigger), use separate variables to track the keys having the latest
timestamp at the last run and current run.
- Probably renaming variables would make code more readable.
- Update only `maxTimestamp` and the keys with the latest timestamp of
current iteration inside the loop, leave the variables which tracks the
previous onTrigger state as it is. Then, after the loop, update the variables
to track `previous` onTrigger state.
Above approach would reflect background better, and also provide cleaner
easily understandable code. I may be overly concerning details, but am feeling
this can be better a bit more. Thanks for your patience!
---