[ https://issues.apache.org/jira/browse/NIFI-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270156#comment-15270156 ]
ASF GitHub Bot commented on NIFI-840: ------------------------------------- Github user adamonduty commented on the pull request: https://github.com/apache/nifi/pull/238#issuecomment-216751696 @apiri Thanks for the review! All your comments should be addressed except the state management pieces. State management should work more or less like `ListHDFS`. I personally found the state management bits of `ListHDFS` difficult to follow, so I tried a similar but different approach. On each restore, we grab two keys: `CURRENT_TIMESTAMP` and `CURRENT_KEY_PREFIX+"0"`. If both of these exist, the state is considered valid, and further restoration is attempted. Mostly that means we parse `CURRENT_TIMESTAMP` into a long and use `extractKeys()` to gather as many keys that start with `CURRENT_KEY_PREFIX` as are present in the state map. These values are later used to decided whether a listed key is newer than last time we ran. At the end of each run, we persist the newest key timestamp and at least one filename with the maximum discovered timestamp. The processor does restore state on each trigger. I think there are a couple things to note: * Listing an S3 bucket generally takes a little while (e.g. 100ms), so that puts a pretty high lower bound on the number of times state might be restored per second. In other words, it seems unlikely we'd try to restore state with a high frequency. * ListS3 also persists state after each trigger. I didn't think about this until after submitting this PR, but really we should be committing flowfiles and persisting state after every batch (1000 objects by default) to avoid unbounded memory growth on large buckets. So in some cases the initial restore is only one of potentially many roundtrips to the state manager per trigger. * I have no concept of how expensive restoring and persisting state actually are. I'd be happy to implement state management caching if you think its necessary. Its probably not complicated, but my general philosophy is that more conditionals = more bugs. > Create ListS3 processor > ----------------------- > > Key: NIFI-840 > URL: https://issues.apache.org/jira/browse/NIFI-840 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Aldrin Piri > Assignee: Adam Lamar > Fix For: 0.7.0 > > > A processor is needed that can provide an S3 listing to use in conjunction > with FetchS3Object. This is to provide a similar user experience as with the > HDFS processors that perform List/Get. -- This message was sent by Atlassian JIRA (v6.3.4#6332)