Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @ijokarumawak I did as you suggested and pulled `persistState` out in the 
case when no new keys have been listed, but this actually caused unit tests to 
fail. This is because `currentTimestamp` never changes during the main loop, so 
even though `commit` calls `persistState`, the value of `currentTimestamp` 
doesn't change until the main loop exits. Which is why `persistState` is 
required in both exit paths.
    
    Instead, I took a slightly different approach with the change just pushed. 
Since `currentTimestamp` is the current value persisted to the state manager, 
`maxTimestamp` is the highest timestamp seen in the main loop, and 
`currentKeys` is tied to `maxTimestamp` (not `currentTimestamp`), I removed the 
`persistState` call in `commit`, and did `persistState` at the end of 
`onTrigger` only. While this does continue to `persistState` on each exit, it 
reduces the number of `persistState` calls to once per `onTrigger` rather than 
once per 1000 keys iterated (which was done previously in `commit`).
    
    I did a bunch of manual testing with concurrent `PutS3Object` and `ListS3` 
and always got the correct number of listed keys, even when uploading 20k+ 
objects using 10 threads. I tried a few strategies to skip `persistState` if 
nothing had changed, but in manual testing it always produced the wrong number 
of keys, although sometimes only off by 1. The current code should be quite an 
improvement to the load on the state manager, even if it isn't ideal.
    
    I also introduced `totalListCount` which helps tighten up the log messages 
a bit. Previously it would "successfully list X objects" followed by "no new 
objects to list" in a single `onTrigger` (this was apparent in the unit test 
output). `totalListCount` also avoids an unnecessary `yield`.
    
    There's a lot going on in this one - let me know if you have any other 
questions!


---

Reply via email to