Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    > User's are not expected to override it unless they want fine grained 
control over the value
    
    This is actually one of the use cases when an user need to take control or 
tune a query. The `defaultParallelism` is used in many places like 
https://github.com/apache/spark/blob/9549a2814951f9ba969955d78ac4bd2240f85989/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L594-L597
 . If he/she wants to tune the behavior in the methods, he/she has to change 
`defaultParallelism`. In this way the factor `5` in `df.repartition(5 * 
sc.defaultParallelism)` should be tune accordingly. In this way we just force 
users to introduce absolutely unnecessary complexity and dependencies in their 
code. If I need number of cores in my cluster, I would like to have a direct 
way to take it instead of hope a method returns me this number implicitly.
    
    > One thing to be kept in mind is that dynamic resource allocation will 
kick in after tasks are submitted ...
    
    Let me show you another use case which I observe in my experience. Our 
customers can write a code in notebooks and can attach their notebooks to 
different cluster. Usually code is developed and debugged on small (staging) 
cluster. After that the notebooks are re-attached to production cluster which 
may have completely different size. Pretty often users just leave existing 
params/constants like in `repartition()` as is. It usually leads to 
underloading or overloading a clusters. Why cannot they use 
`defaultParallelism` everywhere? Look at the use case above - tuning one part 
of user's app requires changing factors in another parts (absolutely 
independent from the first one).   



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to