orhanguvenc created NIFI-15321:
----------------------------------
Summary: Add new execution mode “All Nodes Except Primary” to
prevent Primary node overload in large-scale ListSFTP deployments
Key: NIFI-15321
URL: https://issues.apache.org/jira/browse/NIFI-15321
Project: Apache NiFi
Issue Type: Improvement
Components: Core Framework
Affects Versions: 2.2.0
Environment: Apache NiFi 2.2.0 running on Kubernetes/OpenShift, 5-node
NiFi cluster. Each NiFi pod is allocated 20 CPU cores, 100 GB RAM, 32 GB JVM
heap, and configured with 80 Timer Driven threads. Approximately 720 ListSFTP
processors operate simultaneously due to 120 integrations × 6 remote SFTP hosts.
Reporter: orhanguvenc
h2. *Problem Overview*
In Kubernetes/OpenShift-based NiFi clusters, Primary-only processors such as
*ListSFTP* cause severe load concentration on the Primary node.
In our environment, we run:
* 5 NiFi pods
* Each pod: *20 CPU cores, 100 GB RAM, 32 GB heap, 80 Timer Driven threads*
* *~720 ListSFTP processors* (120 integrations × 6 remote SFTP hosts)
Because ListSFTP must run with {*}Execution = Primary Node{*}, all 720 ListSFTP
processors execute exclusively on whichever pod becomes Primary.
This saturates the Primary’s thread pool and CPU capacity, causing:
* Thread starvation on Primary
* Downstream processors (FetchSFTP,UpdateAttribute, CompressContent,
DetectDublicate, CustomProcessor etc.) on the Primary node to stop making
progress
* FlowFiles assigned to Primary via load-balanced connections to remain stuck
in queues
* Cluster-wide throughput collapse and processing imbalance
This is {*}not solvable by adding hardware{*}, because:
* Primary election is dynamic in Kubernetes
* Any pod can become Primary
* Every pod must be sized for the worst-case ListSFTP load
* Scheduling limitations—not hardware—cause the bottleneck
----
h2. *Requested Improvement*
h3. *Add a new execution mode: “All Nodes Except Primary”*
This mode would allow heavy downstream processors to run on {_}all cluster
nodes except the Primary{_}.
Primary would continue running Primary-only processors (e.g., ListSFTP),
without being required to handle the high CPU/thread load of downstream
processing.
h3. *Benefits*
* Prevents Primary node overload
* Ensures downstream processors never run on a saturated Primary
* Eliminates queue imbalance caused by Primary starvation
* Improves cluster-wide throughput and stability
* Fully compatible with Kubernetes/OpenShift ephemeral scheduling
* Reduces the need for overprovisioning (20 CPUs / 100 GB RAM per pod becomes
unnecessary)
----
h2. *Alternative Enhancements (Secondary options)*
# *Dedicated thread pool for Primary-only processors*
Prevents them from starving general processor scheduling.
# *Improved workload isolation for listing processors*
Allows Primary to handle coordination work without blocking dataflow execution.
However, the *minimum impactful change* is adding:
h3. *Execution = All Nodes Except Primary*
----
h2. *Conclusion*
Large ingestion architectures (hundreds of ListSFTP processors) require
improved scheduling primitives to keep clusters stable.
Introducing *“All Nodes Except Primary”* would allow NiFi to scale horizontally
and reliably in modern Kubernetes environments and prevent Primary node
overload that occurs with stateful listing processors.
We will gladly provide thread dumps, scheduling traces, and cluster performance
metrics if needed for development or validation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)