1. ListS3 uses the framework's state management. (see the persistState() <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java#L140> and restoreState() <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java#L129> methods)
2. The ListS3 state tracks the latest modified timestamp and any keys that go with it that have already been processed. On subsequent runs, it should exclude any file before the timestamp or in the list of keys processed for the timestamp. If the files are being re-written on S3 with new timestamps, I believe ListS3 will see that as a new file. Do you see duplication of the FlowFile name and id fields, or just the name? Do you see the duplicate under any specific circumstances like after processor or instance starts/stop or during periods of high or low flow volume? On Sun, Jun 26, 2016 at 7:30 AM, ddewaele <[email protected]> wrote: > Hi, > > I had a question on the ListS3 processor. > I'm using it to monitor the content of an S3 bucket. > The idea is that when new files come in, they need to be processed and sent > through the dataflow, using a FetchS3Object to process the file. This all > works but I had 2 questions : > > 1. Where does the S3 processor keep its state ? How does it know what files > it has already processed and is there a way to clear this state ? > 2. Sometimes, when syncing files to my S3 buckets, I notice that the ListS3 > processor is picking up the same file twice. Is there a way to avoid that ? > > > > > > -- > View this message in context: > http://apache-nifi-developer-list.39713.n7.nabble.com/ListS3-processor-question-duplicate-files-maintaining-state-tp12278.html > Sent from the Apache NiFi Developer List mailing list archive at > Nabble.com. >
