[
https://issues.apache.org/jira/browse/NIFI-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270156#comment-15270156
]
ASF GitHub Bot commented on NIFI-840:
-------------------------------------
Github user adamonduty commented on the pull request:
https://github.com/apache/nifi/pull/238#issuecomment-216751696
@apiri Thanks for the review! All your comments should be addressed except
the state management pieces.
State management should work more or less like `ListHDFS`. I personally
found the state management bits of `ListHDFS` difficult to follow, so I tried a
similar but different approach.
On each restore, we grab two keys: `CURRENT_TIMESTAMP` and
`CURRENT_KEY_PREFIX+"0"`. If both of these exist, the state is considered
valid, and further restoration is attempted. Mostly that means we parse
`CURRENT_TIMESTAMP` into a long and use `extractKeys()` to gather as many keys
that start with `CURRENT_KEY_PREFIX` as are present in the state map. These
values are later used to decided whether a listed key is newer than last time
we ran.
At the end of each run, we persist the newest key timestamp and at least
one filename with the maximum discovered timestamp.
The processor does restore state on each trigger. I think there are a
couple things to note:
* Listing an S3 bucket generally takes a little while (e.g. 100ms), so that
puts a pretty high lower bound on the number of times state might be restored
per second. In other words, it seems unlikely we'd try to restore state with a
high frequency.
* ListS3 also persists state after each trigger. I didn't think about this
until after submitting this PR, but really we should be committing flowfiles
and persisting state after every batch (1000 objects by default) to avoid
unbounded memory growth on large buckets. So in some cases the initial restore
is only one of potentially many roundtrips to the state manager per trigger.
* I have no concept of how expensive restoring and persisting state
actually are.
I'd be happy to implement state management caching if you think its
necessary. Its probably not complicated, but my general philosophy is that more
conditionals = more bugs.
> Create ListS3 processor
> -----------------------
>
> Key: NIFI-840
> URL: https://issues.apache.org/jira/browse/NIFI-840
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Aldrin Piri
> Assignee: Adam Lamar
> Fix For: 0.7.0
>
>
> A processor is needed that can provide an S3 listing to use in conjunction
> with FetchS3Object. This is to provide a similar user experience as with the
> HDFS processors that perform List/Get.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)