I think Joe's perspective maps more closely to what Andre was searching for in terms of a knowing when a consumer can be notified/guaranteed of successful handoff of data in the overall flow process. Primarily, the key factor is that this mechanism provides at least once delivery in that the unit of work for accepting this data completes before acknowledgement creates the round trip of the transaction; any speed bumps along the way after that commit occurs could cause the possible acknowledgement to not make it back to your producer. This ties into Oleg's point about catastrophic failure, as unfortunate circumstances depending on timing could cause data duplication as highlighted in the developer guide. Regardless, this data is captured in the content repository and enjoys the same copy on write/pass by reference semantics that underpin a lot of NiFi's performance.
Oleg's first point picks up at the juncture where data has moved beyond the initial consumption outlined by yourself an above and details the process and ties into the content repository's key features. While that data will get streamed in by the consumer and enters the purview of NiFi that ownership does not occur until the aforementioned commit. If exactly once semantics is something that is important for a particular application, there are ways of greatly aiding that process using something like DetectDuplicate driven by a background cache. After that commit, that particular file could have many different paths and ways in which it is processed with varying outcomes. Awesome to hear you are continuing work on extending the capabilities and we look forward to aiding further in your contribution. Excellent question to be mindful of in the course of being a responsible producer in ensuring the data delivery. On Tue, Dec 8, 2015 at 5:00 AM, Oleg Zhurakousky < ozhurakou...@hortonworks.com> wrote: > At the high level we try not to copy anything unless we have to, so when > you say “under NiFi care” it becomes a bit unclear. For example, one may be > copying a file using zero-copy algorithm. Let’s assume that NiFi was the > facilitator of that process. With that said, the data is/was never under > NiFi management because nothing was read into memory to perform copy. Now, > even if something is read in memory, what does it really mean from your > perspective? Technically one may argue that ‘record’ is now under NiFi > management and it could be acknowledged. But what if somewhere downstream > the processing of this record fails? > > Basically, IMHO your question is about Transactional capabilities where > transaction implies that acknowledgment will be provided *only* when a > record is fully processed and its re-processing may never happen again with > the exception of catastrophic failures. > If, so giving asynchronous nature of NiFi, it may not be as straight > forward process, albeit doable. > > But before we get to that, let us know if my rumblings above are not > totally off ;). > > Cheers > Oleg > > > On Dec 8, 2015, at 3:07 AM, Andre <andre-li...@fucs.org> wrote: > > > > All, > > > > Still working on the lumberjack processor. Data is currently being > decoded, > > SSL is sort of working but before I start wrapping up I wanted to > confirm: > > > > Lumberjack is a protocol that includes the dispatch of an acknowledgement > > message to the producing agent. > > > > As consequence, usually a producer tailing a file will only update its > > offset AFTER receiving the acknowledgement from the lumberjack endpoint. > > > > Ideally this acknowledgement should only be sent after data is no longer > in > > the processor memory buffers and the chances of memory loss are > restricted > > to catastrophic failure. > > > > Which leads to my question: From a development point of view, at what > stage > > data is assumed to be under NiFi's care? > > > > I thank you in advance. > >