kito4 commented on issue #17661:
URL: 
https://github.com/apache/dolphinscheduler/issues/17661#issuecomment-3511343849

   In production environments, multiple workflows often reserve more resources 
(CPU, RAM) than they actually use.
   For example, several tasks each declare 8 GB RAM but only consume 1–2 GB on 
average.
   As a result, cluster utilization stays low even though no new workflows can 
be scheduled — because declared resources exceed physical capacity.
   To improve efficiency, we can apply controlled resource oversubscription: 
temporarily allocating more logical resources than physically available, based 
on real usage metrics.
   However, DolphinScheduler currently lacks mechanisms to monitor real-time 
utilization or to manage safe oversubscription without risking node overload or 
instability.
   
   Key Components:
   ResourceMonitorAnalyzer — processes real-time CPU and memory data already 
reported by worker heartbeats and monitoring controllers.
   OversubscriptionController — calculates oversubscription ratio and decides 
whether to allow or delay task dispatch.
   PolicyEngine — defines prioritization and throttling rules under 
oversubscription.
   MetricsReporter — exports oversubscription metrics to existing metrics 
framework (Prometheus, REST API).
   
   Workflow
   1Each worker periodically reports actual resource usage (usedCPU, 
usedMemory).
   2 The OversubscriptionController calculates:
   oversubscription_ratio = (allocated_resources / physical_resources)
   utilization_rate = (used_resources / physical_resources)
   If utilization_rate < threshold (e.g., 60%), new tasks can be accepted even 
if allocated > 100%.
   If utilization_rate > safety limit (e.g., 90%), controller triggers 
back-pressure and suspends new task dispatch.
   Tasks can be prioritized based on workflow class (CRITICAL > NORMAL > 
BEST_EFFORT)
   Configuration Parameters
   maxOversubscriptionFactor (Maximum ratio of allocated to physical resources 
allowed ) 1.5
   lowUtilizationThreshold  (CPU/memory usage below which oversubscription is 
safe)        60%
   highUtilizationThreshold ( Utilization above which task submission is 
throttled)                90% 
   priorityMode     ( Workflow scheduling priority mode  )                      
NORMAL
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to