[
https://issues.apache.org/jira/browse/NIFI-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761984#comment-15761984
]
Bryan Rosander commented on NIFI-3225:
--------------------------------------
[~JPercivall]
I wasn't aware of the history of using get(50) before run duration was
introduced. I thought to test this after seeing that that was how some other
(apparently legacy) processors operate.
I've been trying to increase throughput through a flow consisting of SplitJson,
EvaluateJsonPath, RouteOnAttribute, UpdateAttribute, and AttributesToJson
processors all with a run duration of 25 ms (much higher and things start to
choke as the queues fill up and swapping begins).
I noticed that about 6% of the AbstractProcessor.onTrigger() time was spend in
session.checkpoint(). When changing the get() call to get(50), the time in
session.checkpoint() dropped to about 2%. It seemed like a relatively low
hanging fruit to get ~= 4% higher throughput and the other points were just a
bonus although I agree that OnScheduled makes sense for things that don't
change.
It seems like the run duration setting and this batch size setting could be
complementary for side effect free processors. Even if session.checkpoint() is
cheap, it isn't free and for things that aren't modifying external state, the
cost can be reduced just by calling session.commit() less frequently.
> Abstract Processor type that batches session.get() and session.commit() calls
> -----------------------------------------------------------------------------
>
> Key: NIFI-3225
> URL: https://issues.apache.org/jira/browse/NIFI-3225
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Bryan Rosander
> Assignee: Bryan Rosander
> Priority: Minor
>
> For processors that are stateless and support batching, it should be safe to
> get and process multiple input FlowFiles for each onTrigger() call.
> This should amortize the cost of session.get(), session.checkpoint(),
> session.commit() as well as any setup in onTrigger() that isn't dependent on
> the FlowFile(s) attributes or content.
> An AbstractBatchingProcessor type should reduce boilerplate code in candidate
> processors and encourage uniform configurability via a property to control
> batch size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)