[jira] [Comment Edited] (NIFI-13885) NiFi revision manager and heartbeat validation conflicts leading to reconnection request

Ravikumar M (Jira) Mon, 21 Apr 2025 03:16:11 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-13885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946087#comment-17946087
 ]


Ravikumar M edited comment on NIFI-13885 at 4/21/25 10:15 AM:
--------------------------------------------------------------

We got the same issue in 2.0.0-M2, this version is already using [JetBrains 
Xodus|https://github.com/JetBrains/xodus] , we didn't see any delay or failure 
for API request processing in the cluster here. But we are still hitting with 
this problem twice now where it impacted us. Is there any solution for this?
{code:java}
2025-04-15 13:45:42,352 INFO [Clustering Tasks Thread-3] 
o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 
2025-04-15T12:45:42.259Z and sent to 
nifi-1.nifi-hs.pipercx.svc.cluster.local:1025 at 
2025-04-15T12:45:42.352338719Z; determining Cluster Coordinator took 4 millis; 
DNS lookup for coordinator took 0 millis; connecting to coordinator took 0 
millis; sending heartbeat took 45 millis; receiving first byte from response 
took 43 millis; receiving full response took 44 millis; total time was 93 
millis 2025-04-15 13:45:44,495 WARN [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,495 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,506 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4347] to 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4351] 2025-04-15 13:45:44,565 WARN [Heartbeat 
Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,565 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,571 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4350] to 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4352] 2025-04-15 13:45:44,650 INFO [Heartbeat 
Monitor Thread-1] o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 3 
heartbeats in 155 millis{code}
 


was (Author: JIRAUSER307668):
We got the same issue in 2.0.0-M2, Is there any solution for this?


{code:java}
2025-04-15 13:45:42,352 INFO [Clustering Tasks Thread-3] 
o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 
2025-04-15T12:45:42.259Z and sent to 
nifi-1.nifi-hs.pipercx.svc.cluster.local:1025 at 
2025-04-15T12:45:42.352338719Z; determining Cluster Coordinator took 4 millis; 
DNS lookup for coordinator took 0 millis; connecting to coordinator took 0 
millis; sending heartbeat took 45 millis; receiving first byte from response 
took 43 millis; receiving full response took 44 millis; total time was 93 
millis 2025-04-15 13:45:44,495 WARN [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,495 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,506 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4347] to 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4351] 2025-04-15 13:45:44,565 WARN [Heartbeat 
Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,565 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,571 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4350] to 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4352] 2025-04-15 13:45:44,650 INFO [Heartbeat 
Monitor Thread-1] o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 3 
heartbeats in 155 millis{code}
 

> NiFi revision manager and heartbeat validation conflicts leading to 
> reconnection request
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-13885
>                 URL: https://issues.apache.org/jira/browse/NIFI-13885
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.23.0, 1.24.0, 1.23.2
>            Reporter: Shubham Sharma
>            Priority: Critical
>         Attachments: hkl25073070_jstack.tar.gz, hkl25073177_jstack.tar.gz
>
>
> With a 5 node NiFi cluster deployment we observed a critical issue where 
> reconnection requests are getting generated frequently due to difference in 
> local coordinator's and given node's revisionUpdateCounter, affecting 
> workflow lifecycle.
> *User Log summary*
> *User request to update controller service on random node(test01)*
> {code:java}
> 2024-10-14 18:44:55,927 INFO [NiFi Web Server-151009] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test01.dvl:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *User request forwarded immediately to coordinator(test02)*
>  
> {code:java}
>   2024-10-14 18:44:55,929 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598
>     {code}
>  
> *Delay to complete request as shown here in coordinator (test02)*
> {code:java}
> 2024-10-14 18:46:17,952 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 GET 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *App Log analysis*
> Added a log statement at the end of updateRevision function in 
> NaiveRevisionManager for printing updated value of revisionUpdateCounter.
>  
> *Coordinator app log summary*
> On main node test01 it took around 48 seconds to process above request for 
> updating disabled controller service via restAPI and update 
> revisionUpdateCounter to *3081375*
>  
> *2024-10-14 18:45:43,455* INFO [NiFi Web Server-151009] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *Coordinator (test02)received its own heartbeat*
>  
> *2024-10-14 18:45:47,143* INFO [Process Cluster Protocol Request-62] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test02:9443, revisionId {*}3081374{*}, activethreadCount 79 , 
> connectionStatus CONNECTED -----
>  
>  *Received test01 node's heartbeat request to coordinator BUT with updated 
> revisionID*
> *2024-10-14 18:45:57,724* INFO [Process Cluster Protocol Request-91] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test01:9443, revisionId {*}3081375{*}, activethreadCount 73 , 
> connectionStatus CONNECTED -----
>  
> *Scheduled Heartbeat validation gets triggered but due to above mismatch it 
> requested reconnection for test01*
> *2024-10-14 18:46:07,654* WARN [Heartbeat Monitor Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator ----- Requesting that {*}test01{*}:9443 
> reconnect to the cluster due to: Node has a Revision Update Count of 
> *3081375* but local value is only {*}3081374{*}. Node appears not to have the 
> appropriate set of Component Revisions -----
>  
> *During reconnection request on test01 node, ie. going under reconnection and 
> synchronisation process, finally coordinator received updated revision 
> counter*
> *2024-10-14 18:46:11,238* INFO [NiFi Web Server-1414436] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *And next heartbeat* *sent by coordinator (test02) to itself* *containing 
> updated revision counter* 
> *2024-10-14 18:46:17,213* INFO [Process Cluster Protocol Request-2] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> {*}test02{*}:9443, revisionId {*}3081375{*}, activethreadCount 111 , 
> connectionStatus CONNECTED -----
>  
> This reconnection attempts is happening frequently and affecting workflow 
> request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NIFI-13885) NiFi revision manager and heartbeat validation conflicts leading to reconnection request

Reply via email to