Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 @markhamstra the purpose of this PR is not to address the topic of dynamic resource management in arbitrarily complex Spark environments. Most Spark users do not operate in such environments. It is to help simple Spark users refactor code such as ```scala df.repartition(25) // and related repartition() + coalesce() variants ``` to make job execution take advantage of additional cores, when they are available. Asking for a greater degree of parallelism than the cores a job has available rarely has significant negative effects (for reasonable values). Asking for a low degree of parallelism when there are lots of cores available has significant negative effects, especially in the common real-world use cases where there is lots of data skew. That's the point that both you and @mridulm seem to be missing. The arguments about resources flexing during job execution to do change this. My team has used this simple technique for years on both static and autoscaling clusters and we've seen meaningful performance improvements in both ETL and ML/AI-related data production for data ranging from gigabytes to petabytes. The idea is simple enough that even data scientists can (and do) easily use it. That's the benefit of this PR and that's why I like it. The cost of this PR is adding two simple & clear methods. The cost-benefit analysis seems obvious. I agree with you that lots more can be done to handle the general case of better matching job resource needs to cluster/pool resources. This work is going to take forever given the current priorities. Let's not deny the majority of Spark users simple & real execution benefits while we dream about amazing architectural improvements. When looking at the net present value of performance, the discount factor is large. Performance improvements now are a lot more valuable than performance improvements in the far future.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org