Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > User's are not expected to override it unless they want fine grained control over the value This is actually one of the use cases when an user need to take control or tune a query. The `defaultParallelism` is used in many places like https://github.com/apache/spark/blob/9549a2814951f9ba969955d78ac4bd2240f85989/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L594-L597 . If he/she wants to tune the behavior in the methods, he/she has to change `defaultParallelism`. In this way the factor `5` in `df.repartition(5 * sc.defaultParallelism)` should be tune accordingly. In this way we just force users to introduce absolutely unnecessary complexity and dependencies in their code. If I need number of cores in my cluster, I would like to have a direct way to take it instead of hope a method returns me this number implicitly. > One thing to be kept in mind is that dynamic resource allocation will kick in after tasks are submitted ... Let me show you another use case which I observe in my experience. Our customers can write a code in notebooks and can attach their notebooks to different cluster. Usually code is developed and debugged on small (staging) cluster. After that the notebooks are re-attached to production cluster which may have completely different size. Pretty often users just leave existing params/constants like in `repartition()` as is. It usually leads to underloading or overloading a clusters. Why cannot they use `defaultParallelism` everywhere? Look at the use case above - tuning one part of user's app requires changing factors in another parts (absolutely independent from the first one).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org