1) And, file permissions may not necessarily allow a write-lock on a file. The NiFi user might only be allowed read permissions to a given file.
2) I really like this concept, +1 to the idea. In this way, the "List" operation is following the unix design philosophy of doing exactly one thing only. Conceptually, sitting between the "List" and "Fetch" operation could be a handful of standard processors designed to filter, augment, or ignore any fetch request. This could be a very powerful way to compose the functionality (though possibly at the expense of simplicity for the dataflow manager). On Wed, Jul 29, 2015 at 1:00 PM, Joe Witt <[email protected]> wrote: > On 1) there are very few guarantees across os. Some support locking but > many apps dont use it. File io is wild wild west of idioms. > > On 2) you certainly can tackle it that way. This gets into the more art > than science part of designing and composing processors. Key is to always > keep the operations person perspective in mind as the user. > > Joe > On Jul 29, 2015 9:25 AM, "Joe Skora" <[email protected]> wrote: > > > 1. Is there any reason it wouldn't work to try to open the files for > write > > and only begin to handle it when it is writable? It seems like a file > > source would typically open for write, write everything, and then close. > > Cases where something re-opens and appends would obviously not work in > that > > case, but that seems a less likely situation. > > > > 2. Is there any value in breaking it into 3 phases, with a "selection" > > phase, "decision" phase, and "handling" phase? The "selection" phase > that > > lists ALL possible files to be considered, the "decision" phase > determines > > which files to process, and the "handling" phase manages processing the > > selected files. Processors in the "decision" provide the "combination of > > signals" Adam mentions, using what ever variety state and other factors > > necessary. Extending the decision logic only requires a new processor. > > Obviously, there's still a bit of back-and-forth among the phase that > would > > have to be worked out for managing file removal, etc. > > > > Joe > > > > On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <[email protected]> wrote: > > > > > Turning noatime on kicks last mod out the window. It is for sure the > > > case when dealing with file IO that there really are no rules. As > > > Adam notes it is about giving options/strategies. > > > > > > Surprisingly hard to do this well. But good MVP options exist to get > > > something out and get more feedback on true need. > > > > > > On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <[email protected]> wrote: > > > > Some additional feature requests for sake of consideration... > > > > > > > > For some file systems (I can think of one), the last modified date > may > > > not > > > > be dependable or possibly not high enough precision. Additional > > > strategies > > > > could be considered for determining whether a file has been > previously > > > > processed. For example, the byte size of the file, or the md5 hash, > or > > > > possibly other signals. > > > > > > > > While these additional strategies may not be coded initially, I think > > > they > > > > would add nice features for the proposed AbstractListFileProcessor. > In > > > > this way, the abstract processor could use one or even a combination > of > > > > signals to determine if a file has been modified and needs to be > pulled > > > > again. > > > > > > > > Additionally, it might be good to have other mechanisms in place to > > mark > > > a > > > > file as unavailable. The "dot file" convention is pretty common, but > > > there > > > > might be additional ways which indicates that a file is still be > > > > manipulated. i.e. maybe not all writers to the file system > understand > > > the > > > > dot file convention, and so other strategies might be required. > > > > > > > > For example, in one processor I worked with, it was required to pull > > the > > > > list of remote files twice in order to monitor the file sizes. If > the > > > file > > > > size stayed consistent between two pulls, it could safely be > considered > > > > ready for processing. However, if the file size differed in the two > > > pulls, > > > > we could assume that a client was still writing to the file. > > > > > > > > Adam > > > > > > > > > > > > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <[email protected]> > > > wrote: > > > > > > > >> Joe S, > > > >> > > > >> I agree, i think the design of List/Fetch HDFS is extremely > applicable > > > to > > > >> this. The way it saves state is by > > > >> using a DistributedMapCacheServer. The intent is to run the List > > > processor > > > >> on primary node only, and it > > > >> will store its state there so that if the primary node is changed, > any > > > >> other node can pick up where the > > > >> last one left off. In order to avoid saving a massive amount of > state > > in > > > >> memory, it stores the timestamp of > > > >> the latest file that it has fetched, as well as all files that have > > that > > > >> same timestamp (timestamp = last modified date > > > >> in this case). So the next time it runs, it can pull just things > whose > > > >> lastModifiedDate is later than or equal to > > > >> that timestamp, but it can still know which things to avoid pulling > > > twice > > > >> because we've saved that info as well. > > > >> > > > >> Now, with ListFile it will be a bit different. We tend to think of > > > GetFile > > > >> and List/Fetch File as pulling from a local > > > >> file system. However, it is also certainly used to pull from a > > > >> network-mounted file system. In this case, all nodes > > > >> in the cluster need the ability to pull the data in unison. So in > this > > > >> case, we will want to save the state in such a way > > > >> that all nodes in the cluster have access to it, in case the primary > > > node > > > >> changes. But if the file is local, we don't want > > > >> to save state across the cluster, because each node needs its own > > state. > > > >> So that would likely just be an extra property > > > >> on the processor. > > > >> > > > >> If saving state locally, it's easy enough to just write to a text > file > > > >> (recommend you allow user to specify the state file > > > >> and default it to conf/ListFile-<processor id>.state or something > like > > > >> that. > > > >> > > > >> I have not documented this pattern. Specifically because we've been > > > >> talking for a while about implementing the Simple > > > >> State Management but we just haven't gotten there yet. I expected > that > > > we > > > >> would have that finished before writing many > > > >> more of these List/Fetch processors. That will radically change how > we > > > >> handle all of this. > > > >> > > > >> But since it is not there... it may actually make sense to just > > refactor > > > >> the ListHDFS processor into an AbstractListFileProcessor > > > >> that is responsible for handling the state management. I am not sure > > how > > > >> complicated that would get, though. Just a > > > >> thought. > > > >> > > > >> Hopefully this helped to clear things up, rather than muddy them up > :) > > > >> Feel free to fire back any questions. > > > >> > > > >> Thanks > > > >> -Mark > > > >> > > > >> > > > >> ---------------------------------------- > > > >> > Date: Wed, 29 Jul 2015 06:42:39 -0400 > > > >> > Subject: Re: Fetch change list > > > >> > From: [email protected] > > > >> > To: [email protected] > > > >> > > > > >> > JoeS > > > >> > > > > >> > Sounds great. I'd ignore my provenance comment as that was really > > > >> > more about how something external could keep tabs on progress, > etc.. > > > >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to > > him > > > >> > for the good bits. But the logic to follow for saving state you'll > > > >> > want is probably the same. > > > >> > > > > >> > Mark - do you have the design of that thing documented anywhere? > It > > > >> > is a good pattern to describe because it is effectively a model > for > > > >> > taking non-scaleable dataflow interfaces and making them behave as > > if > > > >> > they were. > > > >> > > > > >> > Thanks > > > >> > JoeW > > > >> > > > > >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <[email protected]> > > wrote: > > > >> >> Joe, > > > >> >> > > > >> >> I'm interested in working on List/FetchFile. It seems like > starting > > > with > > > >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes > > > sense. > > > >> >> I'll look at List/FetchHDFS, but is there any further detail on > how > > > this > > > >> >> functionality should differ from GetFile? As for keeping state, > > > >> >> provenance was suggested, a separate state folder might work, or > > some > > > >> file > > > >> >> systems support additional state that might be usable. > > > >> >> > > > >> >> Regards, > > > >> >> Joe > > > >> >> > > > >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <[email protected]> > > > wrote: > > > >> >> > > > >> >>> Anup, > > > >> >>> > > > >> >>> The two tickets in question appear to be: > > > >> >>> https://issues.apache.org/jira/browse/NIFI-631 > > > >> >>> https://issues.apache.org/jira/browse/NIFI-673 > > > >> >>> > > > >> >>> Neither have been claimed as of yet. Anybody interested in > taking > > > one > > > >> >>> or both of these on? It would be a lot like List/Fetch HDFS so > > > you'll > > > >> >>> have good examples to work from. > > > >> >>> > > > >> >>> Thanks > > > >> >>> Joe > > > >> >>> > > > >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup > > > >> >>> <[email protected]> wrote: > > > >> >>>> Can I expect this functionality in the upcoming releases of > Nifi > > ? > > > >> >>>> > > > >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" < > [email protected] > > > > > > >> wrote: > > > >> >>>> > > > >> >>>>>Where is this 1TB dataset living today? > > > >> >>>>>[anup] Resides in a filesystem > > > >> >>>>> > > > >> >>>>>- What is the current nature of the dataset? Is it already in > > large > > > >> >>>>>bundles as files or is it a series of tiny messages, etc..? > Does > > it > > > >> >>>>>need to be split/merged/etc.. > > > >> >>>>>[anup] Archived files of size 3MB each collected over a period. > > > >> Directory > > > >> >>>>>(1TB) -> Sub-Directories -> Files > > > >> >>>>> > > > >> >>>>>- What is the format of the data? Is it something that can > easily > > > be > > > >> >>>>>split/merged or will it require special processes to do so? > > > >> >>>>>[anup] zip, tar formats. > > > >> >>>>> > > > >> >>>>> > > > >> >>>>> > > > >> >>>>>-- > > > >> >>>>>View this message in context: > > > >> >>>>> > > > >> >>> > > > >> > > > > > > http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha > > > >> >>>>>nge-list-tp1351p2126.html > > > >> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing > > list > > > >> >>>>>archive at Nabble.com. > > > >> >>>>> > > > >> >>>>>________________________________ > > > >> >>>>>The information contained in this message may be confidential > and > > > >> legally > > > >> >>>>>protected under applicable law. The message is intended solely > > for > > > the > > > >> >>>>>addressee(s). If you are not the intended recipient, you are > > hereby > > > >> >>>>>notified that any use, forwarding, dissemination, or > reproduction > > > of > > > >> this > > > >> >>>>>message is strictly prohibited and may be unlawful. If you are > > not > > > the > > > >> >>>>>intended recipient, please contact the sender by return e-mail > > and > > > >> >>>>>destroy all copies of the original message. > > > >> >>>> > > > >> >>>> > > > >> >>>> ________________________________ > > > >> >>>> The information contained in this message may be confidential > and > > > >> >>> legally protected under applicable law. The message is intended > > > solely > > > >> for > > > >> >>> the addressee(s). If you are not the intended recipient, you are > > > hereby > > > >> >>> notified that any use, forwarding, dissemination, or > reproduction > > of > > > >> this > > > >> >>> message is strictly prohibited and may be unlawful. If you are > not > > > the > > > >> >>> intended recipient, please contact the sender by return e-mail > and > > > >> destroy > > > >> >>> all copies of the original message. > > > >> >>> > > > >> > > > >> > > > > > >
