[
https://issues.apache.org/jira/browse/NIFI-7081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026758#comment-17026758
]
Mark Payne commented on NIFI-7081:
----------------------------------
This is basically the same concept as NIFI-6787. However, the commit for
NIFI-6787 had to be rolled back (via NIFI-7076). An explanation of the reason
is given in NIFI-7076. In short, it caused backpressure to be not be applied
correctly.
There are two alternate solutions that come to mind.
The first solution would be a rather large undertaking, but would be to have a
"First Available" Load Balancing Strategy. In this strategy, the data would be
pulled from nodes as they are ready to be processed instead of being pushed
(perhaps when the queue falls below some threshold, the data should be pulled
from other nodes in the background? This would generally result in less
latency. Otherwise, it could truly be pulled on-demand, but this means that the
nodes would have to either know which nodes in the cluster hold data to pull,
or constantly be asking all nodes if data is available).
Another approach, which would be less efficient, but still may provide better
behavior than the current implementation, would be to periodically (or in
response to some event) check if the partitions are "lopsided" or if it's
fairly evenly distributed. If lopsided, it would rebalance the data in the
larger partitions across all partitions.
> Improve handling of Load Balanced Connections when one node is slow
> -------------------------------------------------------------------
>
> Key: NIFI-7081
> URL: https://issues.apache.org/jira/browse/NIFI-7081
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Mark Payne
> Priority: Major
>
> When a connection is configured to use Round Robin load balancing, the
> FlowFIle Queue works by queuing up one FlowFile to be processed locally, one
> to be sent to Node 2, one to be sent to Node 3, the next one to be locally
> processed, etc. (in this case, assuming a 3-node cluster).
> If one node in a cluster is slow, though, we can have a situation where the
> local partition is empty and the partition for Node 2 is empty. But Node 3's
> partition is full, because Node 3 is not processing the data quickly enough.
> As a result, on Node 1, the queue ends up applying backpressure, with all
> FlowFiles in the queue waiting to be pushed to Node 3.
> In such a situation, we end up preventing any data from being processed by
> Node 1 or Node 2. It would be advantageous to improve this so that Node 1 and
> Node 2 could still be busy processing data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)