Dan W created NIFI-14150:
----------------------------

             Summary: Site to Site Remote Process Group component not cycling 
through listed uri's
                 Key: NIFI-14150
                 URL: https://issues.apache.org/jira/browse/NIFI-14150
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
    Affects Versions: 2.0.0
         Environment: Windows Server 2022 Standard - 10.0.20348 Build 20348
jdk-21.0.5+11
            Reporter: Dan W


The Remote Process Group component has a text input for URIs which are allowed 
to be comma separated. 

In my case there are 4 possible destination clusters - Each cluster running 5 
nodes.  2 Clusters are at Site 1, marked Cluster A and B, the other 2 at Site 
2, as follows:
{code:java}
S1A1, S1A2, S1A3, S1A4, S1A4
S1B1, S1B2, S1B3, S1B4, S1B5
S2A1, S2A2, S2A3, S2A4, S2A5
S2B1, S2B2, S2B3, S2B4, S2B5
{code}
As a list of uris, I added them respectively:
{code:java}
https://S1A1:8443/nifi,https://S1A2:8443/nifi...https://S2B5:8443/nifi
{code}
I then shut down the Input processor on S1A1 and watched the error log:

 
{code:java}
org.apache.nifi.processor.exception.ProcessException: 
org.apache.nifi.remote.exception.PortNotRunningException: 
Peer[url=nifi://S1A2:10443,CLOSED] indicates that port 
3819c3ca-018e-1000-5225-ace7c2fb62d5 is not running
        at 
org.apache.nifi.remote.StandardRemoteGroupPort.onTrigger(StandardRemoteGroupPort.java:231)
        at 
org.apache.nifi.controller.AbstractPort.onTrigger(AbstractPort.java:260)
        at 
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244)
        at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102)
        at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.nifi.remote.exception.PortNotRunningException: 
Peer[url=nifi://S1A2:10443,CLOSED] indicates that port 
3819c3ca-018e-1000-5225-ace7c2fb62d5 is not running
        at 
org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:255)
        at 
org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:127)
        at 
org.apache.nifi.remote.StandardRemoteGroupPort.onTrigger(StandardRemoteGroupPort.java:224)
        ... 10 common frames omitted {code}
Those errors continued to cycle throughout Nodes S1A1 - S1A5, but never 
attempted to cycle through S1B or S2.

I then switched the URIs list to:
{code:java}
https://S1A1:8443/nifi,https://S1B1:8443/nifi,https://S2A1:8443/nifi,https://S2B1:8443/nifi{code}
And the same list continued to cycle, only ever from S1A1 nodes 1 - 5.

I concluded that only the first entry in the URI list for the Remote Process 
Group is ever being attempted for the purposes of sending a flow file.

 

As these tests so far were only inclusive of disabling the input port on the 
receiving cluster, I further tested by deliberately falsifying the destination 
port on the uri list for the first entry, switching it to a port that is known 
to be unbound on the receiving cluster.

In that event, after some tries, there was an automatic failover to S1B1 - a 
partially desirable outcome.

 

There are 2 points to raise on this topic:

A) The description for the uri list is as follows:

"Specify the remote target NiFi URLs. Multiple URLs can be specified in 
comma-separated format. Different protocols cannot be mixed. If remote NiFi is 
a cluster, two or more node URLs are recommended for better connection 
establishment availability."

Highlighting, "If remote NiFi is a cluster, two or more node URLs are 
recommended for better connection establishment availability." - this would 
appear to be a false statement.

 

The only behavior thus far that I've noticed is that it will reach out to the 
first item on the list, and that cluster coordinator will respond with it's 
most available node to send to over port 10443 (or whatever is configured for 
site to site within nifi.properties).  Thus there is no greater availability 
when adding any additional nodes from that same cluster to the list, as any 
node will be responding with what the coordinator appears to be relaying.

 

B) Load balancing - In our case, this is the more important concept.  If the 
behavior is deliberate, such that a responsive nifi instance (even one that is 
making the sending node aware that it is online, but the destination inputPort 
is disabled) being online is sufficient and the sending host shall wait until 
the input port is made available - I suggest some sort of property which amends 
this behavior.  In our case, if we are working on the cluster for some reason, 
we may want it to failover to the next cluster on the list while that is the 
case.

As well, it would be ideal that we add all 4 of our receiving clusters to the 
uri list, and they are cycled through within the list.  If S2A1 is up, AND S1B1 
is up, send to both (this doesn't have to be round-robin, it could be based on 
various metrics reported back from the receiving cluster to include a proper 
load balancing, but it could also be as simple as round-robin).  If Site 2 goes 
down, then continue balancing between S1B1 and S1A1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to