[
https://issues.apache.org/jira/browse/NIFI-13885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925723#comment-17925723
]
Shubham Sharma edited comment on NIFI-13885 at 2/10/25 8:25 PM:
----------------------------------------------------------------
*Brief Analysis of the Issue:*
To prevent others from falling into the same rabbit hole, here’s a summary of
our findings:
Upon collecting *JFR dumps* at regular intervals, we identified a performance
bottleneck when updating a *controller service* that involved:
1.Modifying the *URL* and *connection credentials*
2.Triggering an update request for the *revision counter*
3.Reloading the *DBCP controller service*
As part of the {*}audit entry process{*}, sensitive password updates were
handled by first *deleting* and then *inserting* records into an {*}H2
database{*}. However, as the database size increased, these operations became
significantly slower.
*Performance Bottleneck & Solution:*
With the community migrating to {*}Xodus from version 1.24.x{*}, we
{*}backported the Xodus audit framework{*}, fixing related issues in the audit
DB. This resolved our update delays and prevented reconnection issues caused by
*validation heartbeats* triggered due to these “negligible” delays.
*Testing Results:*
*Before (Using H2 Database)*
•{*}Database Size:{*} Single *17GB* fragmented & un-compacted file
•{*}Response Time:{*} *6-7 seconds* delay for a password update request
*After (Using Xodus Database)*
•{*}Database Size:{*} {*}240 files{*}, each {*}8MB{*}, totalling *2GB*
•{*}Response Time:{*} *100-200 milliseconds* for the same password update
request
By switching to {*}Xodus{*}, we observed *significant performance improvements*
compared to H2, making the system far more efficient and responsive.
was (Author: shubh_init):
*Brief Analysis of the Issue:*
To prevent others from falling into the same rabbit hole, here’s a summary of
our findings:
Upon collecting *JFR dumps* at regular intervals, we identified a performance
bottleneck when updating a *controller service* that involved:
1.Modifying the *URL* and *connection credentials*
2.Triggering an update request for the *revision counter*
3.Reloading the *DBCP controller service*
As part of the {*}audit entry process{*}, sensitive password updates were
handled by first *deleting* and then *inserting* records into an {*}H2
database{*}. However, as the database size increased, these operations became
significantly slower.
*Performance Bottleneck & Solution:*
With the community migrating to {*}Xodus from version 1.24.x{*}, we
{*}backported the Xodus audit framework{*}, fixing related issues in the audit
DB. This resolved our update delays and prevented reconnection issues caused by
*validation heartbeats* triggered due to these “negligible” delays.
{*}Testing Results:{*}{*}{*}
*Before (Using H2 Database)*
•{*}Database Size:{*} Single *17GB* fragmented & un-compacted file
•{*}Response Time:{*} *6-7 seconds* delay for a password update request
*After (Using Xodus Database)*
•{*}Database Size:{*} {*}240 files{*}, each {*}8MB{*}, totalling *2GB*
•{*}Response Time:{*} *100-200 milliseconds* for the same password update
request
By switching to {*}Xodus{*}, we observed *significant performance improvements*
compared to H2, making the system far more efficient and responsive.
> NiFi revision manager and heartbeat validation conflicts leading to
> reconnection request
> ----------------------------------------------------------------------------------------
>
> Key: NIFI-13885
> URL: https://issues.apache.org/jira/browse/NIFI-13885
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.23.0, 1.24.0, 1.23.2
> Reporter: Shubham Sharma
> Priority: Critical
> Attachments: hkl25073070_jstack.tar.gz, hkl25073177_jstack.tar.gz
>
>
> With a 5 node NiFi cluster deployment we observed a critical issue where
> reconnection requests are getting generated frequently due to difference in
> local coordinator's and given node's revisionUpdateCounter, affecting
> workflow lifecycle.
> *User Log summary*
> *User request to update controller service on random node(test01)*
> {code:java}
> 2024-10-14 18:44:55,927 INFO [NiFi Web Server-151009]
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app]
> 10.10.21.9 PUT
> https://test01.dvl:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>
> *User request forwarded immediately to coordinator(test02)*
>
> {code:java}
> 2024-10-14 18:44:55,929 INFO [NiFi Web Server-1414436]
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app]
> 10.10.21.9 PUT
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598
> {code}
>
> *Delay to complete request as shown here in coordinator (test02)*
> {code:java}
> 2024-10-14 18:46:17,952 INFO [NiFi Web Server-1414436]
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app]
> 10.10.21.9 GET
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>
> *App Log analysis*
> Added a log statement at the end of updateRevision function in
> NaiveRevisionManager for printing updated value of revisionUpdateCounter.
>
> *Coordinator app log summary*
> On main node test01 it took around 48 seconds to process above request for
> updating disabled controller service via restAPI and update
> revisionUpdateCounter to *3081375*
>
> *2024-10-14 18:45:43,455* INFO [NiFi Web Server-151009]
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61,
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61,
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter
> *3081375*
>
> *Coordinator (test02)received its own heartbeat*
>
> *2024-10-14 18:45:47,143* INFO [Process Cluster Protocol Request-62]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from
> test02:9443, revisionId {*}3081374{*}, activethreadCount 79 ,
> connectionStatus CONNECTED -----
>
> *Received test01 node's heartbeat request to coordinator BUT with updated
> revisionID*
> *2024-10-14 18:45:57,724* INFO [Process Cluster Protocol Request-91]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from
> test01:9443, revisionId {*}3081375{*}, activethreadCount 73 ,
> connectionStatus CONNECTED -----
>
> *Scheduled Heartbeat validation gets triggered but due to above mismatch it
> requested reconnection for test01*
> *2024-10-14 18:46:07,654* WARN [Heartbeat Monitor Thread-1]
> o.a.n.c.c.node.NodeClusterCoordinator ----- Requesting that {*}test01{*}:9443
> reconnect to the cluster due to: Node has a Revision Update Count of
> *3081375* but local value is only {*}3081374{*}. Node appears not to have the
> appropriate set of Component Revisions -----
>
> *During reconnection request on test01 node, ie. going under reconnection and
> synchronisation process, finally coordinator received updated revision
> counter*
> *2024-10-14 18:46:11,238* INFO [NiFi Web Server-1414436]
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61,
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61,
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter
> *3081375*
>
> *And next heartbeat* *sent by coordinator (test02) to itself* *containing
> updated revision counter*
> *2024-10-14 18:46:17,213* INFO [Process Cluster Protocol Request-2]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from
> {*}test02{*}:9443, revisionId {*}3081375{*}, activethreadCount 111 ,
> connectionStatus CONNECTED -----
>
> This reconnection attempts is happening frequently and affecting workflow
> request.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)