[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

ssimeonov Wed, 18 Jul 2018 15:40:12 -0700

Github user ssimeonov commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    @markhamstra the purpose of this PR is not to address the topic of dynamic 
resource management in arbitrarily complex Spark environments. Most Spark users 
do not operate in such environments. It is to help simple Spark users refactor 
code such as
    
    ```scala
    df.repartition(25) // and related repartition() + coalesce() variants
    ```
    
    to make job execution take advantage of additional cores, when they are 
available. 
    
    Asking for a greater degree of parallelism than the cores a job has 
available rarely has significant negative effects (for reasonable values). 
Asking for a low degree of parallelism when there are lots of cores available 
has significant negative effects, especially in the common real-world use cases 
where there is lots of data skew. That's the point that both you and @mridulm 
seem to be missing. The arguments about resources flexing during job execution 
to do change this. 
    
    My team has used this simple technique for years on both static and 
autoscaling clusters and we've seen meaningful performance improvements in both 
ETL and ML/AI-related data production for data ranging from gigabytes to 
petabytes. The idea is simple enough that even data scientists can (and do) 
easily use it. That's the benefit of this PR and that's why I like it. The cost 
of this PR is adding two simple & clear methods. The cost-benefit analysis 
seems obvious.
    
    I agree with you that lots more can be done to handle the general case of 
better matching job resource needs to cluster/pool resources. This work is 
going to take forever given the current priorities. Let's not deny the majority 
of Spark users simple & real execution benefits while we dream about amazing 
architectural improvements. 
    
    When looking at the net present value of performance, the discount factor 
is large. Performance improvements now are a lot more valuable than performance 
improvements in the far future.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Reply via email to