Matthew Clarke created NIFI-4475:
------------------------------------

             Summary: Processors that use session.get(batchsize) will yield if 
multiple inbound connections exist where at least one connection is empty.
                 Key: NIFI-4475
                 URL: https://issues.apache.org/jira/browse/NIFI-4475
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
    Affects Versions: 1.3.0
            Reporter: Matthew Clarke



There is a difference between how the NiFi framework handles batches of 
incoming data  (session.get(batchsize)) versus 1 FlowFile (Session.get()) at a 
time.

For example PutSyslog does batches and putUDP processes 1 FlowFile at a time.

With the batch method, a thread is used to poll connection 1 and requests a 
batch of FlowFiles.  If it gets at least 1 FlowFile, it sends that FlowFile(s) 
and ends that thread.  On next thread it round-robins to the next connection 
(Looped failure relationship for example) and requests a batch again.  If that 
connection is empty, the framework assumes there is no work to do and yields 
the processor for the configured "yield duration".  So regardless of run 
schedule, this processor will not run again for the configured yield duration.

With processors that only work on 1 FlowFile at a time. The thread will 
round-robin all the inbound connections until it finds a FlowFile.  If it does 
not find a FlowFile in any connection the framework will yield the processor 
for the configured yield duration.

The intent of yield duration is to keep processors with the default runs 
schedule of 0 sec from using excessive CPU doing nothing; however, in the case 
of batches it will yield even if FlowFiles exist on another connection.  This 
can have a huge impact on throughput performance of processors that use 
session.get(batchsize)


There are two possible work-arounds to this issue:

1. You should see improved performance when multiple inbound connections exist 
(where any connection may be normally empty) by reducing the configured yield 
duration. The result is better throughput but at the expense of more CPU usage 
when all connections are truly empty.

2. Only have one inbound connection to processor that work on batches. This can 
be accomplished by using a funnel.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to