[jira] [Commented] (NIFI-3225) Abstract Processor type that batches session.get() and session.commit() calls

Bryan Rosander (JIRA) Mon, 19 Dec 2016 10:54:25 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761984#comment-15761984
 ]


Bryan Rosander commented on NIFI-3225:
--------------------------------------

[~JPercivall] 

I wasn't aware of the history of using get(50) before run duration was 
introduced.  I thought to test this after seeing that that was how some other 
(apparently legacy) processors operate.

I've been trying to increase throughput through a flow consisting of SplitJson, 
EvaluateJsonPath, RouteOnAttribute, UpdateAttribute, and AttributesToJson 
processors all with a run duration of 25 ms (much higher and things start to 
choke as the queues fill up and swapping begins).

I noticed that about 6% of the AbstractProcessor.onTrigger() time was spend in 
session.checkpoint().  When changing the get() call to get(50), the time in 
session.checkpoint() dropped to about 2%.  It seemed like a relatively low 
hanging fruit to get ~= 4% higher throughput and the other points were just a 
bonus although I agree that OnScheduled makes sense for things that don't 
change.

It seems like the run duration setting and this batch size setting could be 
complementary for side effect free processors.  Even if session.checkpoint() is 
cheap, it isn't free and for things that aren't modifying external state, the 
cost can be reduced just by calling session.commit() less frequently.

> Abstract Processor type that batches session.get() and session.commit() calls
> -----------------------------------------------------------------------------
>
>                 Key: NIFI-3225
>                 URL: https://issues.apache.org/jira/browse/NIFI-3225
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Bryan Rosander
>            Assignee: Bryan Rosander
>            Priority: Minor
>
> For processors that are stateless and support batching, it should be safe to 
> get and process multiple input FlowFiles for each onTrigger() call.  
> This should amortize the cost of session.get(), session.checkpoint(), 
> session.commit() as well as any setup in onTrigger() that isn't dependent on 
> the FlowFile(s) attributes or content.
> An AbstractBatchingProcessor type should reduce boilerplate code in candidate 
> processors and encourage uniform configurability via a property to control 
> batch size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-3225) Abstract Processor type that batches session.get() and session.commit() calls

Reply via email to