[ 
https://issues.apache.org/jira/browse/NIFI-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270156#comment-15270156
 ] 

ASF GitHub Bot commented on NIFI-840:
-------------------------------------

Github user adamonduty commented on the pull request:

    https://github.com/apache/nifi/pull/238#issuecomment-216751696
  
    @apiri Thanks for the review! All your comments should be addressed except 
the state management pieces.
    
    State management should work more or less like `ListHDFS`. I personally 
found the state management bits of `ListHDFS` difficult to follow, so I tried a 
similar but different approach.
    
    On each restore, we grab two keys: `CURRENT_TIMESTAMP` and 
`CURRENT_KEY_PREFIX+"0"`. If both of these exist, the state is considered 
valid, and further restoration is attempted. Mostly that means we parse 
`CURRENT_TIMESTAMP` into a long and use `extractKeys()` to gather as many keys 
that start with `CURRENT_KEY_PREFIX` as are present in the state map. These 
values are later used to decided whether a listed key is newer than last time 
we ran. 
    
    At the end of each run, we persist the newest key timestamp and at least 
one filename with the maximum discovered timestamp.
    
    The processor does restore state on each trigger. I think there are a 
couple things to note:
    * Listing an S3 bucket generally takes a little while (e.g. 100ms), so that 
puts a pretty high lower bound on the number of times state might be restored 
per second. In other words, it seems unlikely we'd try to restore state with a 
high frequency.
    * ListS3 also persists state after each trigger. I didn't think about this 
until after submitting this PR, but really we should be committing flowfiles 
and persisting state after every batch (1000 objects by default) to avoid 
unbounded memory growth on large buckets. So in some cases the initial restore 
is only one of potentially many roundtrips to the state manager per trigger. 
    * I have no concept of how expensive restoring and persisting state 
actually are.
    
    I'd be happy to implement state management caching if you think its 
necessary. Its probably not complicated, but my general philosophy is that more 
conditionals = more bugs.


> Create ListS3 processor
> -----------------------
>
>                 Key: NIFI-840
>                 URL: https://issues.apache.org/jira/browse/NIFI-840
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Aldrin Piri
>            Assignee: Adam Lamar
>             Fix For: 0.7.0
>
>
> A processor is needed that can provide an S3 listing to use in conjunction 
> with FetchS3Object.  This is to provide a similar user experience as with the 
> HDFS processors that perform List/Get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to