Matthew Clarke created NIFI-4475:
------------------------------------
Summary: Processors that use session.get(batchsize) will yield if
multiple inbound connections exist where at least one connection is empty.
Key: NIFI-4475
URL: https://issues.apache.org/jira/browse/NIFI-4475
Project: Apache NiFi
Issue Type: Improvement
Components: Core Framework
Affects Versions: 1.3.0
Reporter: Matthew Clarke
There is a difference between how the NiFi framework handles batches of
incoming data (session.get(batchsize)) versus 1 FlowFile (Session.get()) at a
time.
For example PutSyslog does batches and putUDP processes 1 FlowFile at a time.
With the batch method, a thread is used to poll connection 1 and requests a
batch of FlowFiles. If it gets at least 1 FlowFile, it sends that FlowFile(s)
and ends that thread. On next thread it round-robins to the next connection
(Looped failure relationship for example) and requests a batch again. If that
connection is empty, the framework assumes there is no work to do and yields
the processor for the configured "yield duration". So regardless of run
schedule, this processor will not run again for the configured yield duration.
With processors that only work on 1 FlowFile at a time. The thread will
round-robin all the inbound connections until it finds a FlowFile. If it does
not find a FlowFile in any connection the framework will yield the processor
for the configured yield duration.
The intent of yield duration is to keep processors with the default runs
schedule of 0 sec from using excessive CPU doing nothing; however, in the case
of batches it will yield even if FlowFiles exist on another connection. This
can have a huge impact on throughput performance of processors that use
session.get(batchsize)
There are two possible work-arounds to this issue:
1. You should see improved performance when multiple inbound connections exist
(where any connection may be normally empty) by reducing the configured yield
duration. The result is better throughput but at the expense of more CPU usage
when all connections are truly empty.
2. Only have one inbound connection to processor that work on batches. This can
be accomplished by using a funnel.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)