[
https://issues.apache.org/jira/browse/NIFI-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762429#comment-15762429
]
Mark Payne commented on NIFI-3225:
----------------------------------
[[email protected]] - I'd be very hesitant to modify this. The changes
you describe here would have the same affect as the current model if we always
call session.commit(). However, any time that the session is rolled back, it
would behave a bit differently. Namely, we'd roll back more than we intended
because session.checkpoint() hadn't been called. The idea behind the session is
that if we call session.rollback(), it should rollback everything since the
last time session.commit() or session.rollback() was called but no more.
Changing how frequently checkpoint() is called would change that.
I'd also recommend digging into this with a profiler, such as VisualVM and/or
YourKit. Because checkpoint() is going to be called for every flowfile in the
case that you describe, it is reasonable to expect it to show up very
frequently when using a sampler instead of a profiler.
As you said, session.checkpoint() is quite cheap, but not free. I think it's
required for correctness, though. We may be able to improve its efficiency
somewhat if its performance is concerning. The majority of its work is
transferring data from one HashMap to another. We may be able to use some other
type of data structure to be more efficient here, or perhaps tune the HashMap
with non-default values for the constructor arguments.
> Abstract Processor type that batches session.get() and session.commit() calls
> -----------------------------------------------------------------------------
>
> Key: NIFI-3225
> URL: https://issues.apache.org/jira/browse/NIFI-3225
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Bryan Rosander
> Assignee: Bryan Rosander
> Priority: Minor
> Attachments: after.png, before.png
>
>
> For processors that are stateless and support batching, it should be safe to
> get and process multiple input FlowFiles for each onTrigger() call.
> This should amortize the cost of session.get(), session.checkpoint(),
> session.commit() as well as any setup in onTrigger() that isn't dependent on
> the FlowFile(s) attributes or content.
> An AbstractBatchingProcessor type should reduce boilerplate code in candidate
> processors and encourage uniform configurability via a property to control
> batch size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)