Joe S, I agree, i think the design of List/Fetch HDFS is extremely applicable to this. The way it saves state is by using a DistributedMapCacheServer. The intent is to run the List processor on primary node only, and it will store its state there so that if the primary node is changed, any other node can pick up where the last one left off. In order to avoid saving a massive amount of state in memory, it stores the timestamp of the latest file that it has fetched, as well as all files that have that same timestamp (timestamp = last modified date in this case). So the next time it runs, it can pull just things whose lastModifiedDate is later than or equal to that timestamp, but it can still know which things to avoid pulling twice because we've saved that info as well.
Now, with ListFile it will be a bit different. We tend to think of GetFile and List/Fetch File as pulling from a local file system. However, it is also certainly used to pull from a network-mounted file system. In this case, all nodes in the cluster need the ability to pull the data in unison. So in this case, we will want to save the state in such a way that all nodes in the cluster have access to it, in case the primary node changes. But if the file is local, we don't want to save state across the cluster, because each node needs its own state. So that would likely just be an extra property on the processor. If saving state locally, it's easy enough to just write to a text file (recommend you allow user to specify the state file and default it to conf/ListFile-<processor id>.state or something like that. I have not documented this pattern. Specifically because we've been talking for a while about implementing the Simple State Management but we just haven't gotten there yet. I expected that we would have that finished before writing many more of these List/Fetch processors. That will radically change how we handle all of this. But since it is not there... it may actually make sense to just refactor the ListHDFS processor into an AbstractListFileProcessor that is responsible for handling the state management. I am not sure how complicated that would get, though. Just a thought. Hopefully this helped to clear things up, rather than muddy them up :) Feel free to fire back any questions. Thanks -Mark ---------------------------------------- > Date: Wed, 29 Jul 2015 06:42:39 -0400 > Subject: Re: Fetch change list > From: [email protected] > To: [email protected] > > JoeS > > Sounds great. I'd ignore my provenance comment as that was really > more about how something external could keep tabs on progress, etc.. > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him > for the good bits. But the logic to follow for saving state you'll > want is probably the same. > > Mark - do you have the design of that thing documented anywhere? It > is a good pattern to describe because it is effectively a model for > taking non-scaleable dataflow interfaces and making them behave as if > they were. > > Thanks > JoeW > > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <[email protected]> wrote: >> Joe, >> >> I'm interested in working on List/FetchFile. It seems like starting with >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense. >> I'll look at List/FetchHDFS, but is there any further detail on how this >> functionality should differ from GetFile? As for keeping state, >> provenance was suggested, a separate state folder might work, or some file >> systems support additional state that might be usable. >> >> Regards, >> Joe >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <[email protected]> wrote: >> >>> Anup, >>> >>> The two tickets in question appear to be: >>> https://issues.apache.org/jira/browse/NIFI-631 >>> https://issues.apache.org/jira/browse/NIFI-673 >>> >>> Neither have been claimed as of yet. Anybody interested in taking one >>> or both of these on? It would be a lot like List/Fetch HDFS so you'll >>> have good examples to work from. >>> >>> Thanks >>> Joe >>> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup >>> <[email protected]> wrote: >>>> Can I expect this functionality in the upcoming releases of Nifi ? >>>> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <[email protected]> wrote: >>>> >>>>>Where is this 1TB dataset living today? >>>>>[anup] Resides in a filesystem >>>>> >>>>>- What is the current nature of the dataset? Is it already in large >>>>>bundles as files or is it a series of tiny messages, etc..? Does it >>>>>need to be split/merged/etc.. >>>>>[anup] Archived files of size 3MB each collected over a period. Directory >>>>>(1TB) -> Sub-Directories -> Files >>>>> >>>>>- What is the format of the data? Is it something that can easily be >>>>>split/merged or will it require special processes to do so? >>>>>[anup] zip, tar formats. >>>>> >>>>> >>>>> >>>>>-- >>>>>View this message in context: >>>>> >>> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha >>>>>nge-list-tp1351p2126.html >>>>>Sent from the Apache NiFi (incubating) Developer List mailing list >>>>>archive at Nabble.com. >>>>> >>>>>________________________________ >>>>>The information contained in this message may be confidential and legally >>>>>protected under applicable law. The message is intended solely for the >>>>>addressee(s). If you are not the intended recipient, you are hereby >>>>>notified that any use, forwarding, dissemination, or reproduction of this >>>>>message is strictly prohibited and may be unlawful. If you are not the >>>>>intended recipient, please contact the sender by return e-mail and >>>>>destroy all copies of the original message. >>>> >>>> >>>> ________________________________ >>>> The information contained in this message may be confidential and >>> legally protected under applicable law. The message is intended solely for >>> the addressee(s). If you are not the intended recipient, you are hereby >>> notified that any use, forwarding, dissemination, or reproduction of this >>> message is strictly prohibited and may be unlawful. If you are not the >>> intended recipient, please contact the sender by return e-mail and destroy >>> all copies of the original message. >>>
