Dan W created NIFI-14150:
----------------------------
Summary: Site to Site Remote Process Group component not cycling
through listed uri's
Key: NIFI-14150
URL: https://issues.apache.org/jira/browse/NIFI-14150
Project: Apache NiFi
Issue Type: Improvement
Components: Extensions
Affects Versions: 2.0.0
Environment: Windows Server 2022 Standard - 10.0.20348 Build 20348
jdk-21.0.5+11
Reporter: Dan W
The Remote Process Group component has a text input for URIs which are allowed
to be comma separated.
In my case there are 4 possible destination clusters - Each cluster running 5
nodes. 2 Clusters are at Site 1, marked Cluster A and B, the other 2 at Site
2, as follows:
{code:java}
S1A1, S1A2, S1A3, S1A4, S1A4
S1B1, S1B2, S1B3, S1B4, S1B5
S2A1, S2A2, S2A3, S2A4, S2A5
S2B1, S2B2, S2B3, S2B4, S2B5
{code}
As a list of uris, I added them respectively:
{code:java}
https://S1A1:8443/nifi,https://S1A2:8443/nifi...https://S2B5:8443/nifi
{code}
I then shut down the Input processor on S1A1 and watched the error log:
{code:java}
org.apache.nifi.processor.exception.ProcessException:
org.apache.nifi.remote.exception.PortNotRunningException:
Peer[url=nifi://S1A2:10443,CLOSED] indicates that port
3819c3ca-018e-1000-5225-ace7c2fb62d5 is not running
at
org.apache.nifi.remote.StandardRemoteGroupPort.onTrigger(StandardRemoteGroupPort.java:231)
at
org.apache.nifi.controller.AbstractPort.onTrigger(AbstractPort.java:260)
at
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244)
at
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.nifi.remote.exception.PortNotRunningException:
Peer[url=nifi://S1A2:10443,CLOSED] indicates that port
3819c3ca-018e-1000-5225-ace7c2fb62d5 is not running
at
org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:255)
at
org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:127)
at
org.apache.nifi.remote.StandardRemoteGroupPort.onTrigger(StandardRemoteGroupPort.java:224)
... 10 common frames omitted {code}
Those errors continued to cycle throughout Nodes S1A1 - S1A5, but never
attempted to cycle through S1B or S2.
I then switched the URIs list to:
{code:java}
https://S1A1:8443/nifi,https://S1B1:8443/nifi,https://S2A1:8443/nifi,https://S2B1:8443/nifi{code}
And the same list continued to cycle, only ever from S1A1 nodes 1 - 5.
I concluded that only the first entry in the URI list for the Remote Process
Group is ever being attempted for the purposes of sending a flow file.
As these tests so far were only inclusive of disabling the input port on the
receiving cluster, I further tested by deliberately falsifying the destination
port on the uri list for the first entry, switching it to a port that is known
to be unbound on the receiving cluster.
In that event, after some tries, there was an automatic failover to S1B1 - a
partially desirable outcome.
There are 2 points to raise on this topic:
A) The description for the uri list is as follows:
"Specify the remote target NiFi URLs. Multiple URLs can be specified in
comma-separated format. Different protocols cannot be mixed. If remote NiFi is
a cluster, two or more node URLs are recommended for better connection
establishment availability."
Highlighting, "If remote NiFi is a cluster, two or more node URLs are
recommended for better connection establishment availability." - this would
appear to be a false statement.
The only behavior thus far that I've noticed is that it will reach out to the
first item on the list, and that cluster coordinator will respond with it's
most available node to send to over port 10443 (or whatever is configured for
site to site within nifi.properties). Thus there is no greater availability
when adding any additional nodes from that same cluster to the list, as any
node will be responding with what the coordinator appears to be relaying.
B) Load balancing - In our case, this is the more important concept. If the
behavior is deliberate, such that a responsive nifi instance (even one that is
making the sending node aware that it is online, but the destination inputPort
is disabled) being online is sufficient and the sending host shall wait until
the input port is made available - I suggest some sort of property which amends
this behavior. In our case, if we are working on the cluster for some reason,
we may want it to failover to the next cluster on the list while that is the
case.
As well, it would be ideal that we add all 4 of our receiving clusters to the
uri list, and they are cycled through within the list. If S2A1 is up, AND S1B1
is up, send to both (this doesn't have to be round-robin, it could be based on
various metrics reported back from the receiving cluster to include a proper
load balancing, but it could also be as simple as round-robin). If Site 2 goes
down, then continue balancing between S1B1 and S1A1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)