[
https://issues.apache.org/jira/browse/NIFI-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne updated NIFI-10052:
------------------------------
Labels: cluster heartbeat stability (was: )
> Avoid obtaining any locks when creating/sending heartbeats
> ----------------------------------------------------------
>
> Key: NIFI-10052
> URL: https://issues.apache.org/jira/browse/NIFI-10052
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Mark Payne
> Priority: Major
> Labels: cluster, heartbeat, stability
>
> When NiFi creates a heartbeat to send to the coordinator, it must obtain a
> few locks in order to generate that heartbeat. We should avoid obtaining any
> read locks, write locks, or synchronized monitors, especially those that may
> be held for a while. Doing so can result in NiFi getting disconnected from
> the cluster if a write lock is held for a long time.
> Specifically, the following locks are obtained, at minimum:
> * FlowController readLock in the createHeartbeatMessage() method. Due to
> refactoring, this read lock is not necessary at all.
> * revisionManager.getRevisionUpdateCount() is synchronized. However, the
> synchronization here is not needed, as it just returns an AtomicLong.get().
> This is perhaps the most important lock to avoid because any update to a
> component or group of components happens within
> revisionManager.updateRevision, which also is synchronized. So a large
> request like deleting thousands of components will block heartbeats from
> being created until this completes.
> * FlowController.getTotalFlowFileCount - this may be the most challenging to
> eliminate. It calls ProcessGroup.getConnections() and
> ProcessGroup.getProcessGroups(), which means that it must obtain the read
> lock of the Process Group twice - for every Process Group in the flow. We may
> be able to change StandardProcessGroup's connections and processGroups maps
> to ConcurrentHashMap's and just introduce a getQueueSize() method on
> ProcessGroup that can avoid having to lock so much
> * This createHeartbeatMessage() method also appears to reference
> FlowController's {{connectionStatus}} member variable without any locks,
> although it is not volatile and documentation indicates that it's guarded by
> read/write lock. So that needs to be addressed in order to ensure that the
> connectionStatus is always accurately reported.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)