[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

MaxGekk Wed, 18 Jul 2018 05:12:36 -0700

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21589

> User's are not expected to override it unless they want fine grained
control over the value

This is actually one of the use cases when an user need to take control or
tune a query. The `defaultParallelism` is used in many places like
https://github.com/apache/spark/blob/9549a2814951f9ba969955d78ac4bd2240f85989/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L594-L597
. If he/she wants to tune the behavior in the methods, he/she has to change
`defaultParallelism`. In this way the factor `5` in `df.repartition(5 *
sc.defaultParallelism)` should be tune accordingly. In this way we just force
users to introduce absolutely unnecessary complexity and dependencies in their
code. If I need number of cores in my cluster, I would like to have a direct
way to take it instead of hope a method returns me this number implicitly.

> One thing to be kept in mind is that dynamic resource allocation will
kick in after tasks are submitted ...

Let me show you another use case which I observe in my experience. Our
customers can write a code in notebooks and can attach their notebooks to
different cluster. Usually code is developed and debugged on small (staging)
cluster. After that the notebooks are re-attached to production cluster which
may have completely different size. Pretty often users just leave existing
params/constants like in `repartition()` as is. It usually leads to
underloading or overloading a clusters. Why cannot they use
`defaultParallelism` everywhere? Look at the use case above - tuning one part
of user's app requires changing factors in another parts (absolutely
independent from the first one).




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Reply via email to