[jira] [Commented] (NIFI-13885) NiFi revision manager and heartbeat validation conflicts leading to reconnection request

Ravikumar M (Jira) Mon, 21 Apr 2025 02:54:04 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-13885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946087#comment-17946087
 ]


Ravikumar M commented on NIFI-13885:
------------------------------------

We got the same issue in 2.0.0-M2, Is there any solution for this?
{{}}
{code:java}
2025-04-15 13:45:42,352 INFO [Clustering Tasks Thread-3] 
o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 
2025-04-15T12:45:42.259Z and sent to 
nifi-1.nifi-hs.pipercx.svc.cluster.local:1025 at 
2025-04-15T12:45:42.352338719Z; determining Cluster Coordinator took 4 millis; 
DNS lookup for coordinator took 0 millis; connecting to coordinator took 0 
millis; sending heartbeat took 45 millis; receiving first byte from response 
took 43 millis; receiving full response took 44 millis; total time was 93 
millis 2025-04-15 13:45:44,495 WARN [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,495 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,506 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-0.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4347] to 
NodeConnectionStatus[nodeId=nifi-0.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4351] 2025-04-15 13:45:44,565 WARN [Heartbeat 
Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Requesting that 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 reconnect to the cluster due to: 
Node has a Revision Update Count of 213336 but local value is only 213335. Node 
appears not to have the appropriate set of Component Revisions 2025-04-15 
13:45:44,565 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 -- Requesting that node connect 
to cluster 2025-04-15 13:45:44,571 INFO [Heartbeat Monitor Thread-1] 
o.a.n.c.c.node.NodeClusterCoordinator Status of 
nifi-2.nifi-hs.pipercx.svc.cluster.local:8443 changed from 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTED, updateId=4350] to 
NodeConnectionStatus[nodeId=nifi-2.nifi-hs.pipercx.svc.cluster.local:8443, 
state=CONNECTING, updateId=4352] 2025-04-15 13:45:44,650 INFO [Heartbeat 
Monitor Thread-1] o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 3 
heartbeats in 155 millis{code}
{{}}

> NiFi revision manager and heartbeat validation conflicts leading to 
> reconnection request
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-13885
>                 URL: https://issues.apache.org/jira/browse/NIFI-13885
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.23.0, 1.24.0, 1.23.2
>            Reporter: Shubham Sharma
>            Priority: Critical
>         Attachments: hkl25073070_jstack.tar.gz, hkl25073177_jstack.tar.gz
>
>
> With a 5 node NiFi cluster deployment we observed a critical issue where 
> reconnection requests are getting generated frequently due to difference in 
> local coordinator's and given node's revisionUpdateCounter, affecting 
> workflow lifecycle.
> *User Log summary*
> *User request to update controller service on random node(test01)*
> {code:java}
> 2024-10-14 18:44:55,927 INFO [NiFi Web Server-151009] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test01.dvl:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *User request forwarded immediately to coordinator(test02)*
>  
> {code:java}
>   2024-10-14 18:44:55,929 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598
>     {code}
>  
> *Delay to complete request as shown here in coordinator (test02)*
> {code:java}
> 2024-10-14 18:46:17,952 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 GET 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *App Log analysis*
> Added a log statement at the end of updateRevision function in 
> NaiveRevisionManager for printing updated value of revisionUpdateCounter.
>  
> *Coordinator app log summary*
> On main node test01 it took around 48 seconds to process above request for 
> updating disabled controller service via restAPI and update 
> revisionUpdateCounter to *3081375*
>  
> *2024-10-14 18:45:43,455* INFO [NiFi Web Server-151009] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *Coordinator (test02)received its own heartbeat*
>  
> *2024-10-14 18:45:47,143* INFO [Process Cluster Protocol Request-62] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test02:9443, revisionId {*}3081374{*}, activethreadCount 79 , 
> connectionStatus CONNECTED -----
>  
>  *Received test01 node's heartbeat request to coordinator BUT with updated 
> revisionID*
> *2024-10-14 18:45:57,724* INFO [Process Cluster Protocol Request-91] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test01:9443, revisionId {*}3081375{*}, activethreadCount 73 , 
> connectionStatus CONNECTED -----
>  
> *Scheduled Heartbeat validation gets triggered but due to above mismatch it 
> requested reconnection for test01*
> *2024-10-14 18:46:07,654* WARN [Heartbeat Monitor Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator ----- Requesting that {*}test01{*}:9443 
> reconnect to the cluster due to: Node has a Revision Update Count of 
> *3081375* but local value is only {*}3081374{*}. Node appears not to have the 
> appropriate set of Component Revisions -----
>  
> *During reconnection request on test01 node, ie. going under reconnection and 
> synchronisation process, finally coordinator received updated revision 
> counter*
> *2024-10-14 18:46:11,238* INFO [NiFi Web Server-1414436] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *And next heartbeat* *sent by coordinator (test02) to itself* *containing 
> updated revision counter* 
> *2024-10-14 18:46:17,213* INFO [Process Cluster Protocol Request-2] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> {*}test02{*}:9443, revisionId {*}3081375{*}, activethreadCount 111 , 
> connectionStatus CONNECTED -----
>  
> This reconnection attempts is happening frequently and affecting workflow 
> request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-13885) NiFi revision manager and heartbeat validation conflicts leading to reconnection request

Reply via email to