orhanguvenc created NIFI-15321:
----------------------------------

             Summary: Add new execution mode “All Nodes Except Primary” to 
prevent Primary node overload in large-scale ListSFTP deployments
                 Key: NIFI-15321
                 URL: https://issues.apache.org/jira/browse/NIFI-15321
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
    Affects Versions: 2.2.0
         Environment: Apache NiFi 2.2.0 running on Kubernetes/OpenShift, 5-node 
NiFi cluster. Each NiFi pod is allocated 20 CPU cores, 100 GB RAM, 32 GB JVM 
heap, and configured with 80 Timer Driven threads. Approximately 720 ListSFTP 
processors operate simultaneously due to 120 integrations × 6 remote SFTP hosts.
            Reporter: orhanguvenc


h2. *Problem Overview*

In Kubernetes/OpenShift-based NiFi clusters, Primary-only processors such as 
*ListSFTP* cause severe load concentration on the Primary node.
In our environment, we run:
 * 5 NiFi pods

 * Each pod: *20 CPU cores, 100 GB RAM, 32 GB heap, 80 Timer Driven threads*

 * *~720 ListSFTP processors* (120 integrations × 6 remote SFTP hosts)

Because ListSFTP must run with {*}Execution = Primary Node{*}, all 720 ListSFTP 
processors execute exclusively on whichever pod becomes Primary.
This saturates the Primary’s thread pool and CPU capacity, causing:
 * Thread starvation on Primary

 * Downstream processors (FetchSFTP,UpdateAttribute, CompressContent, 
DetectDublicate, CustomProcessor etc.) on the Primary node to stop making 
progress

 * FlowFiles assigned to Primary via load-balanced connections to remain stuck 
in queues

 * Cluster-wide throughput collapse and processing imbalance

This is {*}not solvable by adding hardware{*}, because:
 * Primary election is dynamic in Kubernetes

 * Any pod can become Primary

 * Every pod must be sized for the worst-case ListSFTP load

 * Scheduling limitations—not hardware—cause the bottleneck

----
h2. *Requested Improvement*
h3. *Add a new execution mode: “All Nodes Except Primary”*

This mode would allow heavy downstream processors to run on {_}all cluster 
nodes except the Primary{_}.
Primary would continue running Primary-only processors (e.g., ListSFTP), 
without being required to handle the high CPU/thread load of downstream 
processing.
h3. *Benefits*
 * Prevents Primary node overload

 * Ensures downstream processors never run on a saturated Primary

 * Eliminates queue imbalance caused by Primary starvation

 * Improves cluster-wide throughput and stability

 * Fully compatible with Kubernetes/OpenShift ephemeral scheduling

 * Reduces the need for overprovisioning (20 CPUs / 100 GB RAM per pod becomes 
unnecessary)

----
h2. *Alternative Enhancements (Secondary options)*
 # *Dedicated thread pool for Primary-only processors*
Prevents them from starving general processor scheduling.

 # *Improved workload isolation for listing processors*
Allows Primary to handle coordination work without blocking dataflow execution.

However, the *minimum impactful change* is adding:
h3. *Execution = All Nodes Except Primary*
----
h2. *Conclusion*

Large ingestion architectures (hundreds of ListSFTP processors) require 
improved scheduling primitives to keep clusters stable.
Introducing *“All Nodes Except Primary”* would allow NiFi to scale horizontally 
and reliably in modern Kubernetes environments and prevent Primary node 
overload that occurs with stateful listing processors.

We will gladly provide thread dumps, scheduling traces, and cluster performance 
metrics if needed for development or validation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to