Joe S,

I agree, i think the design of List/Fetch HDFS is extremely applicable to this. 
The way it saves state is by
using a DistributedMapCacheServer. The intent is to run the List processor on 
primary node only, and it
will store its state there so that if the primary node is changed, any other 
node can pick up where the
last one left off. In order to avoid saving a massive amount of state in 
memory, it stores the timestamp of
the latest file that it has fetched, as well as all files that have that same 
timestamp (timestamp = last modified date
in this case). So the next time it runs, it can pull just things whose 
lastModifiedDate is later than or equal to
that timestamp, but it can still know which things to avoid pulling twice 
because we've saved that info as well.

Now, with ListFile it will be a bit different. We tend to think of GetFile and 
List/Fetch File as pulling from a local
file system. However, it is also certainly used to pull from a network-mounted 
file system. In this case, all nodes
in the cluster need the ability to pull the data in unison. So in this case, we 
will want to save the state in such a way
that all nodes in the cluster have access to it, in case the primary node 
changes. But if the file is local, we don't want
to save state across the cluster, because each node needs its own state. So 
that would likely just be an extra property
on the processor.

If saving state locally, it's easy enough to just write to a text file 
(recommend you allow user to specify the state file
and default it to conf/ListFile-<processor id>.state or something like that.

I have not documented this pattern. Specifically because we've been talking for 
a while about implementing the Simple
State Management but we just haven't gotten there yet. I expected that we would 
have that finished before writing many
more of these List/Fetch processors. That will radically change how we handle 
all of this.

But since it is not there... it may actually make sense to just refactor the 
ListHDFS processor into an AbstractListFileProcessor
that is responsible for handling the state management. I am not sure how 
complicated that would get, though. Just a
thought.

Hopefully this helped to clear things up, rather than muddy them up :) Feel 
free to fire back any questions.

Thanks
-Mark


----------------------------------------
> Date: Wed, 29 Jul 2015 06:42:39 -0400
> Subject: Re: Fetch change list
> From: [email protected]
> To: [email protected]
>
> JoeS
>
> Sounds great. I'd ignore my provenance comment as that was really
> more about how something external could keep tabs on progress, etc..
> Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
> for the good bits. But the logic to follow for saving state you'll
> want is probably the same.
>
> Mark - do you have the design of that thing documented anywhere? It
> is a good pattern to describe because it is effectively a model for
> taking non-scaleable dataflow interfaces and making them behave as if
> they were.
>
> Thanks
> JoeW
>
> On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <[email protected]> wrote:
>> Joe,
>>
>> I'm interested in working on List/FetchFile. It seems like starting with
>> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
>> I'll look at List/FetchHDFS, but is there any further detail on how this
>> functionality should differ from GetFile? As for keeping state,
>> provenance was suggested, a separate state folder might work, or some file
>> systems support additional state that might be usable.
>>
>> Regards,
>> Joe
>>
>> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <[email protected]> wrote:
>>
>>> Anup,
>>>
>>> The two tickets in question appear to be:
>>> https://issues.apache.org/jira/browse/NIFI-631
>>> https://issues.apache.org/jira/browse/NIFI-673
>>>
>>> Neither have been claimed as of yet. Anybody interested in taking one
>>> or both of these on? It would be a lot like List/Fetch HDFS so you'll
>>> have good examples to work from.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
>>> <[email protected]> wrote:
>>>> Can I expect this functionality in the upcoming releases of Nifi ?
>>>>
>>>> On 13/07/15 9:13 am, "Sethuram, Anup" <[email protected]> wrote:
>>>>
>>>>>Where is this 1TB dataset living today?
>>>>>[anup] Resides in a filesystem
>>>>>
>>>>>- What is the current nature of the dataset? Is it already in large
>>>>>bundles as files or is it a series of tiny messages, etc..? Does it
>>>>>need to be split/merged/etc..
>>>>>[anup] Archived files of size 3MB each collected over a period. Directory
>>>>>(1TB) -> Sub-Directories -> Files
>>>>>
>>>>>- What is the format of the data? Is it something that can easily be
>>>>>split/merged or will it require special processes to do so?
>>>>>[anup] zip, tar formats.
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>View this message in context:
>>>>>
>>> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>>>>>nge-list-tp1351p2126.html
>>>>>Sent from the Apache NiFi (incubating) Developer List mailing list
>>>>>archive at Nabble.com.
>>>>>
>>>>>________________________________
>>>>>The information contained in this message may be confidential and legally
>>>>>protected under applicable law. The message is intended solely for the
>>>>>addressee(s). If you are not the intended recipient, you are hereby
>>>>>notified that any use, forwarding, dissemination, or reproduction of this
>>>>>message is strictly prohibited and may be unlawful. If you are not the
>>>>>intended recipient, please contact the sender by return e-mail and
>>>>>destroy all copies of the original message.
>>>>
>>>>
>>>> ________________________________
>>>> The information contained in this message may be confidential and
>>> legally protected under applicable law. The message is intended solely for
>>> the addressee(s). If you are not the intended recipient, you are hereby
>>> notified that any use, forwarding, dissemination, or reproduction of this
>>> message is strictly prohibited and may be unlawful. If you are not the
>>> intended recipient, please contact the sender by return e-mail and destroy
>>> all copies of the original message.
>>>
                                          

Reply via email to