Devin,

So the session follows the unit of work pattern that Martin Fowler
describes [1].  The idea here is that whether you access one flow file
or many flow files or anything in between you are doing some session
of work.  You can commit all the things you do on that session or
rollback all the things you do on that session.

This pattern/concept provides a nice and safe environment in which the
contract between the developer and framework is well understood.
Generally, rollback is only necessary when some unplanned or
unexpected exception occurs and it is largely there so that the
framework can ensure all things are returned to a safe state.  It can
also do things like penalize that processor/extension so that if it is
some programming error that it will reduce its impact on the system
overall.

So, with that said there are two ways to think about your case.  It
appears you are doing your own batching and probably this is for
higher throughput and also it appears you'd really like to treat each
flow file independently in terms of logic/handling.

This is precisely why in addition to this nice clean unit of work
pattern we also support automated session batching (this is what
Andrew was referring to).  In this mode you can add an annotation to
your processor called @SupportsBatching which signals to the framework
that it may attempt to automatically combine subsequent calls to
commit into small batches and commit them in a single batch.  In this
way you can build your processor in a very simple single flow file
sort of manner and call commit.  But the framework will combine a
series of commits in a very small time window together to get higher
throughput.  In the UI a user can signal their willingness to let the
framework to do this and acknowledge that they may be trading off some
small latency in favor of higher throughput.  Now there are some
additional things to think about when using this.  For instance, it is
best used when the processor and its function is side effect free
meaning that there are no external system state changes or things like
that.  In this sense you can think of the processor you're building as
idempotent (as the REST folks like to say).  If your processor fits
that then SupportsBatching can have really powerful results.

Now, you also mention that some flow files you'd consider failures and
others you'd consider something else, presumably success.  This is
perfect and very common and does not require a rollback.  Keep
rollback in mind for bad stuff that can happen that you don't plan
for.  For the scenario of failures that you can predict such as
invalid data or invalid state of something you actually want to have a
failure relationship on that processor and simply route things there.
>From a 'developer' perspective this is not a rollback case.  "Failure"
then is as planned for and expected as "success".  So you go ahead and
route the flowfile to failure and call commit.  All good.  It is the
person designing this loosely coupled and highly cohesive set of
components together in a flow that gets to decide what failure means
for their context.

Lots of info here and probably not well written or with big gaps.
You're asking the right questions so just keep asking.  Lots of folks
here that want to help.

[1] http://martinfowler.com/eaaCatalog/unitOfWork.html
[2] 
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#session-rollback

Thanks
Joe

On Tue, Mar 15, 2016 at 7:25 PM, Devin Fisher
<devin.fis...@perfectsearchcorp.com> wrote:
> Thanks for your reply. I'm sorry if my question seems confusing. I'm still
> learning how nifi works. I don't have any understand about how the
> framework works on the back end and incomplete understanding of the exposed
> interface. From my point view (an external process developer) asking to
> rollback the one flow file that failed (I don't want changes made to it
> incompletely) and lets the other n flowfiles move on seems reasonable. But
> I don't know what is happening in the session on the back end.
>
> I likely don't really understand what happens on a rollback. Reading the
> developer's guide I got the impression that rollback disregards all changes
> made the session include transfers. It then returns the flowfiles to the
> queue. It would seem that a session is really finished and not usable after
> a rollback. So, I then don't understand how I can do my use case. I want to
> rollback (undo changes to a single flow file that failed) and then transfer
> it to the Failed relationship unchanged (or add the discard.reason to the
> attributes).
>
> I assume you mean "Run duration" when you refer to the 'scheduling' tab. I
> would love to understand better how that works. In the documentation, I
> only see a note about it in the User guide.  But the developer's guide is
> silent. I don't see how that slider is enforced in the processor code. It
> seems that once the framework has ceded control to the processor it can run
> for as long as it wants. So more information about this would be great.
>
> Thanks again for the response. The information is always useful and
> enlighting.
> Devin
>
> On Tue, Mar 15, 2016 at 4:26 PM, Andrew Grande <agra...@hortonworks.com>
> wrote:
>
>> Devin,
>>
>> What you're asking for is a contradicting requirement. One trades
>> individual message transactional control (and necessary overhead) for the
>> higher throughput with micro-batching (but lesser control). In short, you
>> can't expect to rollback a message and not affect the whole batch.
>>
>> However, if you 'commit' this batch as received by your processor, and
>> take on the responsibility of storing, tracking and commit/rollback of
>> those yourself for downstream connection.... But then, why?
>>
>> In general, one should leverage NiFi 'Scheduling' tab and have the
>> micro-batching aspect controlled via the framework. Unless you really
>> really have a very good reason to do it yourself.
>>
>> Hope this helps,
>> Andrew
>>
>>
>>
>>
>> On 3/7/16, 5:00 PM, "Devin Fisher" <devin.fis...@perfectsearchcorp.com>
>> wrote:
>>
>> >Question about rollbacks. I have a processor that is grabbing a list of
>> >FlowFiles from session.get(100). It will then process each flow file one
>> at
>> >a time.  I want to then be able if there is an error with a single
>> FlowFile
>> >to roll it back (and only this failed FlowFile) and transfer it to the
>> >FAILED relationship. But reading the javadoc for ProcessSession I don't
>> get
>> >the sense that I can do that.
>> >
>> >Is my workflow wrong, should I only get one at a time from the session and
>> >commit after each one?
>> >
>> >Devin
>>

Reply via email to