[jira] [Commented] (NIFI-6787) Load balanced connection can hold up for whole cluster when one node slows down

Mark Payne (Jira) Fri, 13 Dec 2019 12:07:18 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995873#comment-16995873
 ]


Mark Payne commented on NIFI-6787:
----------------------------------

[~tpalfy] thanks, now I get it! Sorry it took so long for me to go back to 
this. The template really helped, though. Was able to very easily see the 
problem that you see and understand your point. Was then able to apply your 
patch and verify that it made a huge different! Awesome!

The only thing that then gave me a bit of pause was that it's called "Round 
Robin" but in reality it's not entirely Round Robin now, more of a "Next 
Available" algorithm because the load is not necessarily spread quite as evenly 
across the cluster. I think that's perfectly okay, though. I will merge this 
change but will add in a small comment to the tooltip in the UI and in the 
javadoc and in the User Guide that just mentions this small caveat.

I saw a huge improvement in the performance though. Many thanks!

> Load balanced connection can hold up for whole cluster when one node slows 
> down
> -------------------------------------------------------------------------------
>
>                 Key: NIFI-6787
>                 URL: https://issues.apache.org/jira/browse/NIFI-6787
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>         Attachments: template.change_hostname_in_groovy_script.xml
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> A slow processor on one node in a cluster can slow down the same processor on 
> the other nodes when load balanced incoming connection is used:
> When a processors decides wether it can pass along a flowfile, it checks 
> whether the outgoing connection is full or not (if full, it yields).
> Similarly when the receiving processor is to be scheduled the framework 
> checks if the incoming connection is empty (if empty, no reason to call 
> {{onTrigger}}).
> The problem is that when checking if it's full or not, it checks the _total_ 
> size (which is across the whole cluster) and compares it to the max (which is 
> scoped to the current node).
> The empty check is (correctly) done on the local partition only.
> This can lead to the case where the slow node fills up its queue while the 
> faster ones empty theirs.
> Once the slow node has a full queue, the fast ones stop receiving new input 
> and thus stop working after their queues get emptied.
> The issue is probably the fact that 
> {{SocketLoadBalancedFlowFileQueue.isFull}} actually calls to 
> {{AbstractFlowFileQueue.isFull}} which checks {{size()}}, but that returns 
> the _total_ size. (The empty check looks fine but for reference it is done 
> via {{SocketLoadBalancedFlowFileQueue.isActiveQueueEmpty}}).
> {code:title=AbstractFlowFileQueue.java|borderStyle=solid}
> ...
>     private MaxQueueSize getMaxQueueSize() {
>         return maxQueueSize.get();
>     }
>     @Override
>     public boolean isFull() {
>         return isFull(size());
>     }
>     protected boolean isFull(final QueueSize queueSize) {
>         final MaxQueueSize maxSize = getMaxQueueSize();
>         // Check if max size is set
>         if (maxSize.getMaxBytes() <= 0 && maxSize.getMaxCount() <= 0) {
>             return false;
>         }
>         if (maxSize.getMaxCount() > 0 && queueSize.getObjectCount() >= 
> maxSize.getMaxCount()) {
>             return true;
>         }
>         if (maxSize.getMaxBytes() > 0 && queueSize.getByteCount() >= 
> maxSize.getMaxBytes()) {
>             return true;
>         }
>         return false;
>     }
> ...
> {code}
> {code:title=SocketLoadBalancedFlowFileQueue.java|borderStyle=solid}
> ...
>     @Override
>     public QueueSize size() {
>         return totalSize.get();
>     }
> ...
>     @Override
>     public boolean isActiveQueueEmpty() {
>         return localPartition.isActiveQueueEmpty();
>     }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-6787) Load balanced connection can hold up for whole cluster when one node slows down

Reply via email to