Re: Fetch change list

Adam Taft Wed, 29 Jul 2015 10:12:04 -0700

1)  And, file permissions may not necessarily allow a write-lock on a
file.  The NiFi user might only be allowed read permissions to a given file.


2)  I really like this concept, +1 to the idea.  In this way, the "List"
operation is following the unix design philosophy of doing exactly one
thing only.  Conceptually, sitting between the "List" and "Fetch" operation
could be a handful of standard processors designed to filter, augment, or
ignore any fetch request.  This could be a very powerful way to compose the
functionality (though possibly at the expense of simplicity for the
dataflow manager).



On Wed, Jul 29, 2015 at 1:00 PM, Joe Witt <[email protected]> wrote:

> On 1) there are very few guarantees across os.  Some support locking but
> many apps dont use it.  File io is wild wild west of idioms.
>
> On 2) you certainly can tackle it that way.     This gets into the more art
> than science part of designing and composing processors.  Key is to always
> keep the operations person perspective in mind as the user.
>
> Joe
> On Jul 29, 2015 9:25 AM, "Joe Skora" <[email protected]> wrote:
>
> > 1. Is there any reason it wouldn't work to try to open the files for
> write
> > and only begin to handle it when it is writable?  It seems like a file
> > source would typically open for write, write everything, and then close.
> > Cases where something re-opens and appends would obviously not work in
> that
> > case, but that seems a less likely situation.
> >
> > 2. Is there any value in breaking it into 3 phases, with a "selection"
> > phase, "decision" phase, and "handling" phase?  The "selection" phase
> that
> > lists ALL possible files to be considered, the "decision" phase
> determines
> > which files to process, and the "handling" phase manages processing the
> > selected files.  Processors in the "decision" provide the "combination of
> > signals" Adam mentions, using what ever variety state and other factors
> > necessary.  Extending the decision logic only requires a new processor.
> > Obviously, there's still a bit of back-and-forth among the phase that
> would
> > have to be worked out for managing file removal, etc.
> >
> > Joe
> >
> > On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <[email protected]> wrote:
> >
> > > Turning noatime on kicks last mod out the window.  It is for sure the
> > > case when dealing with file IO that there really are no rules.  As
> > > Adam notes it is about giving options/strategies.
> > >
> > > Surprisingly hard to do this well.  But good MVP options exist to get
> > > something out and get more feedback on true need.
> > >
> > > On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <[email protected]> wrote:
> > > > Some additional feature requests for sake of consideration...
> > > >
> > > > For some file systems (I can think of one), the last modified date
> may
> > > not
> > > > be dependable or possibly not high enough precision.  Additional
> > > strategies
> > > > could be considered for determining whether a file has been
> previously
> > > > processed.  For example, the byte size of the file, or the md5 hash,
> or
> > > > possibly other signals.
> > > >
> > > > While these additional strategies may not be coded initially, I think
> > > they
> > > > would add nice features for the proposed AbstractListFileProcessor.
> In
> > > > this way, the abstract processor could use one or even a combination
> of
> > > > signals to determine if a file has been modified and needs to be
> pulled
> > > > again.
> > > >
> > > > Additionally, it might be good to have other mechanisms in place to
> > mark
> > > a
> > > > file as unavailable.  The "dot file" convention is pretty common, but
> > > there
> > > > might be additional ways which indicates that a file is still be
> > > > manipulated.  i.e. maybe not all writers to the file system
> understand
> > > the
> > > > dot file convention, and so other strategies might be required.
> > > >
> > > > For example, in one processor I worked with, it was required to pull
> > the
> > > > list of remote files twice in order to monitor the file sizes.  If
> the
> > > file
> > > > size stayed consistent between two pulls, it could safely be
> considered
> > > > ready for processing.  However, if the file size differed in the two
> > > pulls,
> > > > we could assume that a client was still writing to the file.
> > > >
> > > > Adam
> > > >
> > > >
> > > > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <[email protected]>
> > > wrote:
> > > >
> > > >> Joe S,
> > > >>
> > > >> I agree, i think the design of List/Fetch HDFS is extremely
> applicable
> > > to
> > > >> this. The way it saves state is by
> > > >> using a DistributedMapCacheServer. The intent is to run the List
> > > processor
> > > >> on primary node only, and it
> > > >> will store its state there so that if the primary node is changed,
> any
> > > >> other node can pick up where the
> > > >> last one left off. In order to avoid saving a massive amount of
> state
> > in
> > > >> memory, it stores the timestamp of
> > > >> the latest file that it has fetched, as well as all files that have
> > that
> > > >> same timestamp (timestamp = last modified date
> > > >> in this case). So the next time it runs, it can pull just things
> whose
> > > >> lastModifiedDate is later than or equal to
> > > >> that timestamp, but it can still know which things to avoid pulling
> > > twice
> > > >> because we've saved that info as well.
> > > >>
> > > >> Now, with ListFile it will be a bit different. We tend to think of
> > > GetFile
> > > >> and List/Fetch File as pulling from a local
> > > >> file system. However, it is also certainly used to pull from a
> > > >> network-mounted file system. In this case, all nodes
> > > >> in the cluster need the ability to pull the data in unison. So in
> this
> > > >> case, we will want to save the state in such a way
> > > >> that all nodes in the cluster have access to it, in case the primary
> > > node
> > > >> changes. But if the file is local, we don't want
> > > >> to save state across the cluster, because each node needs its own
> > state.
> > > >> So that would likely just be an extra property
> > > >> on the processor.
> > > >>
> > > >> If saving state locally, it's easy enough to just write to a text
> file
> > > >> (recommend you allow user to specify the state file
> > > >> and default it to conf/ListFile-<processor id>.state or something
> like
> > > >> that.
> > > >>
> > > >> I have not documented this pattern. Specifically because we've been
> > > >> talking for a while about implementing the Simple
> > > >> State Management but we just haven't gotten there yet. I expected
> that
> > > we
> > > >> would have that finished before writing many
> > > >> more of these List/Fetch processors. That will radically change how
> we
> > > >> handle all of this.
> > > >>
> > > >> But since it is not there... it may actually make sense to just
> > refactor
> > > >> the ListHDFS processor into an AbstractListFileProcessor
> > > >> that is responsible for handling the state management. I am not sure
> > how
> > > >> complicated that would get, though. Just a
> > > >> thought.
> > > >>
> > > >> Hopefully this helped to clear things up, rather than muddy them up
> :)
> > > >> Feel free to fire back any questions.
> > > >>
> > > >> Thanks
> > > >> -Mark
> > > >>
> > > >>
> > > >> ----------------------------------------
> > > >> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> > > >> > Subject: Re: Fetch change list
> > > >> > From: [email protected]
> > > >> > To: [email protected]
> > > >> >
> > > >> > JoeS
> > > >> >
> > > >> > Sounds great. I'd ignore my provenance comment as that was really
> > > >> > more about how something external could keep tabs on progress,
> etc..
> > > >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to
> > him
> > > >> > for the good bits. But the logic to follow for saving state you'll
> > > >> > want is probably the same.
> > > >> >
> > > >> > Mark - do you have the design of that thing documented anywhere?
> It
> > > >> > is a good pattern to describe because it is effectively a model
> for
> > > >> > taking non-scaleable dataflow interfaces and making them behave as
> > if
> > > >> > they were.
> > > >> >
> > > >> > Thanks
> > > >> > JoeW
> > > >> >
> > > >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <[email protected]>
> > wrote:
> > > >> >> Joe,
> > > >> >>
> > > >> >> I'm interested in working on List/FetchFile. It seems like
> starting
> > > with
> > > >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes
> > > sense.
> > > >> >> I'll look at List/FetchHDFS, but is there any further detail on
> how
> > > this
> > > >> >> functionality should differ from GetFile? As for keeping state,
> > > >> >> provenance was suggested, a separate state folder might work, or
> > some
> > > >> file
> > > >> >> systems support additional state that might be usable.
> > > >> >>
> > > >> >> Regards,
> > > >> >> Joe
> > > >> >>
> > > >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <[email protected]>
> > > wrote:
> > > >> >>
> > > >> >>> Anup,
> > > >> >>>
> > > >> >>> The two tickets in question appear to be:
> > > >> >>> https://issues.apache.org/jira/browse/NIFI-631
> > > >> >>> https://issues.apache.org/jira/browse/NIFI-673
> > > >> >>>
> > > >> >>> Neither have been claimed as of yet. Anybody interested in
> taking
> > > one
> > > >> >>> or both of these on? It would be a lot like List/Fetch HDFS so
> > > you'll
> > > >> >>> have good examples to work from.
> > > >> >>>
> > > >> >>> Thanks
> > > >> >>> Joe
> > > >> >>>
> > > >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> > > >> >>> <[email protected]> wrote:
> > > >> >>>> Can I expect this functionality in the upcoming releases of
> Nifi
> > ?
> > > >> >>>>
> > > >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <
> [email protected]
> > >
> > > >> wrote:
> > > >> >>>>
> > > >> >>>>>Where is this 1TB dataset living today?
> > > >> >>>>>[anup] Resides in a filesystem
> > > >> >>>>>
> > > >> >>>>>- What is the current nature of the dataset? Is it already in
> > large
> > > >> >>>>>bundles as files or is it a series of tiny messages, etc..?
> Does
> > it
> > > >> >>>>>need to be split/merged/etc..
> > > >> >>>>>[anup] Archived files of size 3MB each collected over a period.
> > > >> Directory
> > > >> >>>>>(1TB) -> Sub-Directories -> Files
> > > >> >>>>>
> > > >> >>>>>- What is the format of the data? Is it something that can
> easily
> > > be
> > > >> >>>>>split/merged or will it require special processes to do so?
> > > >> >>>>>[anup] zip, tar formats.
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>--
> > > >> >>>>>View this message in context:
> > > >> >>>>>
> > > >> >>>
> > > >>
> > >
> >
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> > > >> >>>>>nge-list-tp1351p2126.html
> > > >> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing
> > list
> > > >> >>>>>archive at Nabble.com.
> > > >> >>>>>
> > > >> >>>>>________________________________
> > > >> >>>>>The information contained in this message may be confidential
> and
> > > >> legally
> > > >> >>>>>protected under applicable law. The message is intended solely
> > for
> > > the
> > > >> >>>>>addressee(s). If you are not the intended recipient, you are
> > hereby
> > > >> >>>>>notified that any use, forwarding, dissemination, or
> reproduction
> > > of
> > > >> this
> > > >> >>>>>message is strictly prohibited and may be unlawful. If you are
> > not
> > > the
> > > >> >>>>>intended recipient, please contact the sender by return e-mail
> > and
> > > >> >>>>>destroy all copies of the original message.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> ________________________________
> > > >> >>>> The information contained in this message may be confidential
> and
> > > >> >>> legally protected under applicable law. The message is intended
> > > solely
> > > >> for
> > > >> >>> the addressee(s). If you are not the intended recipient, you are
> > > hereby
> > > >> >>> notified that any use, forwarding, dissemination, or
> reproduction
> > of
> > > >> this
> > > >> >>> message is strictly prohibited and may be unlawful. If you are
> not
> > > the
> > > >> >>> intended recipient, please contact the sender by return e-mail
> and
> > > >> destroy
> > > >> >>> all copies of the original message.
> > > >> >>>
> > > >>
> > > >>
> > >
> >
>

Re: Fetch change list

Reply via email to