[ https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Joseph updated NIFI-12232: ------------------------------- Description: This is an issue that we have been observing in the 1.23.2 version of NiFi when we try upgrade, Since Rolling upgrade is not supported in NiFi, we scale out the revision that is running and {_}run a helm upgrade{_}. We have NIFI running in k8s cluster mode, there is a post job that call the Tenants and policies API. On a successful run it would run like this {code:java} set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '200' set_policies() Action: 'read' Resource: '/tenants' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200'{code} *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently with the error logs; {code:java} set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '400' An error occurred getting 'read' '/flow' policy: 'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'{code} {{_*'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'*_}} The job is configured to run only after all the pods are up and running. Though the pods are up we see exception is the inside pods {code:java} org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) ... 4 common frames omitted Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) ... 10 common frames omitted{code} Attaching screenshots of the UI as well. this issue is observed a lot checking with CLI command. {code:java} ./cli.sh nifi cluster-summary -u https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks /opt/nifi/cert_mgr/keystore.j ks -kst jks -ksp changeit Total node count: 0 Connected node count: 0 Clustered: true Connected to cluster: false{code} We tried Workaround {code:java} 1.Exec to the pod that has the flow file issue, delete the flow file so that it deletes from the PVC 2. Exit from pod 3. Delete the pod that had the problem{code} Pod will respwan, cluster coordinator will recreate the flowfile from the connected nodes This connected all the nodes. But this does not feel like an ideal solution as we're seeing this issue quite often and cannot run this WA every time !image-2023-10-16-16-12-31-027.png! was: This is an issue that we have been observing in the 1.23.2 version of NiFi when we try upgrade, Since Rolling upgrade is not supported in NiFi, we scale out the revision that is running and run a helm upgrade. We have NIFI running in k8s cluster mode, there is a post job that call the Tenants and policies API. On a successful run it would run like this {code:java} set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '200' set_policies() Action: 'read' Resource: '/tenants' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200'{code} *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In {{{}1.23.2{}}}, we are noticing that the job is failing very frequently with the error logs; {code:java} set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '400' An error occurred getting 'read' '/flow' policy: 'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'{code} {{_*'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'*_}} The job is configured to run only after all the pods are up and running. Though the pods are up we see exception is the inside pods {code:java} org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) ... 4 common frames omitted Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) ... 10 common frames omitted{code} Attaching screenshots of the UI as well. this issue is observed a lot checking with CLI command. {code:java} ./cli.sh nifi cluster-summary -u https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks /opt/nifi/cert_mgr/keystore.j ks -kst jks -ksp changeit Total node count: 0 Connected node count: 0 Clustered: true Connected to cluster: false{code} We tried Workaround {code:java} 1.Exec to the pod that has the flow file issue, delete the flow file so that it deletes from the PVC 2. Exit from pod 3. Delete the pod that had the problem{code} Pod will respwan, cluster coordinator will recreate the flowfile from the connected nodes This connected all the nodes. But this does not feel like an ideal solution as we're seeing this issue quite often and cannot run this WA every time !image-2023-10-16-16-12-31-027.png! > Frequent failed to connect node to cluster because local flow controller > partially updated. Administrator should disconnect node and review flow for > corruption > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: NIFI-12232 > URL: https://issues.apache.org/jira/browse/NIFI-12232 > Project: Apache NiFi > Issue Type: Bug > Components: Configuration Management > Affects Versions: 1.23.2 > Reporter: John Joseph > Priority: Major > Attachments: image-2023-10-16-16-12-31-027.png > > > This is an issue that we have been observing in the 1.23.2 version of NiFi > when we try upgrade, > Since Rolling upgrade is not supported in NiFi, we scale out the revision > that is running and {_}run a helm upgrade{_}. > We have NIFI running in k8s cluster mode, there is a post job that call the > Tenants and policies API. On a successful run it would run like this > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '200' > set_policies() Action: 'read' Resource: '/tenants' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200'{code} > *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In > {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently > with the error logs; > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '400' > An error occurred getting 'read' '/flow' policy: 'This node is disconnected > from its configured cluster. The requested change will only be allowed if the > flag to acknowledge the disconnected node is set.'{code} > {{_*'This node is disconnected from its configured cluster. The requested > change will only be allowed if the flag to acknowledge the disconnected node > is set.'*_}} > The job is configured to run only after all the pods are up and running. > Though the pods are up we see exception is the inside pods > {code:java} > org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed > to connect node to cluster because local flow controller partially updated. > Administrator should disconnect node and review flow for corruption. > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) > at > org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: > org.apache.nifi.controller.serialization.FlowSynchronizationException: > java.lang.IllegalStateException: Cannot change destination of Connection > because the current destination is running > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) > at > org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) > at > org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) > at > org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) > at > org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) > ... 4 common frames omitted > Caused by: java.lang.IllegalStateException: Cannot change destination of > Connection because the current destination is running > at > org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) > at > org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) > at > org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) > ... 10 common frames omitted{code} > Attaching screenshots of the UI as well. this issue is observed a lot > checking with CLI command. > {code:java} > ./cli.sh nifi cluster-summary -u > https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts > /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks > /opt/nifi/cert_mgr/keystore.j > ks -kst jks -ksp changeit > Total node count: 0 > Connected node count: 0 > Clustered: true > Connected to cluster: false{code} > > We tried Workaround > {code:java} > 1.Exec to the pod that has the flow file issue, delete the flow file so that > it deletes from the PVC > 2. Exit from pod > 3. Delete the pod that had the problem{code} > Pod will respwan, cluster coordinator will recreate the flowfile from the > connected nodes > This connected all the nodes. But this does not feel like an ideal solution > as we're seeing this issue quite often and cannot run this WA every time > !image-2023-10-16-16-12-31-027.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)