kito4 commented on issue #17661: URL: https://github.com/apache/dolphinscheduler/issues/17661#issuecomment-3511343849
In production environments, multiple workflows often reserve more resources (CPU, RAM) than they actually use. For example, several tasks each declare 8 GB RAM but only consume 1–2 GB on average. As a result, cluster utilization stays low even though no new workflows can be scheduled — because declared resources exceed physical capacity. To improve efficiency, we can apply controlled resource oversubscription: temporarily allocating more logical resources than physically available, based on real usage metrics. However, DolphinScheduler currently lacks mechanisms to monitor real-time utilization or to manage safe oversubscription without risking node overload or instability. Key Components: ResourceMonitorAnalyzer — processes real-time CPU and memory data already reported by worker heartbeats and monitoring controllers. OversubscriptionController — calculates oversubscription ratio and decides whether to allow or delay task dispatch. PolicyEngine — defines prioritization and throttling rules under oversubscription. MetricsReporter — exports oversubscription metrics to existing metrics framework (Prometheus, REST API). Workflow 1Each worker periodically reports actual resource usage (usedCPU, usedMemory). 2 The OversubscriptionController calculates: oversubscription_ratio = (allocated_resources / physical_resources) utilization_rate = (used_resources / physical_resources) If utilization_rate < threshold (e.g., 60%), new tasks can be accepted even if allocated > 100%. If utilization_rate > safety limit (e.g., 90%), controller triggers back-pressure and suspends new task dispatch. Tasks can be prioritized based on workflow class (CRITICAL > NORMAL > BEST_EFFORT) Configuration Parameters maxOversubscriptionFactor (Maximum ratio of allocated to physical resources allowed ) 1.5 lowUtilizationThreshold (CPU/memory usage below which oversubscription is safe) 60% highUtilizationThreshold ( Utilization above which task submission is throttled) 90% priorityMode ( Workflow scheduling priority mode ) NORMAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
