[
https://issues.apache.org/jira/browse/SPARK-24591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516595#comment-16516595
]
Apache Spark commented on SPARK-24591:
--------------------------------------
User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21589
> Number of cores and executors in the cluster
> --------------------------------------------
>
> Key: SPARK-24591
> URL: https://issues.apache.org/jira/browse/SPARK-24591
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.3.1
> Reporter: Maxim Gekk
> Priority: Minor
>
> Need to add 2 new methods. The first one should return total number of CPU
> cores of all executors in the cluster. The second one should give current
> number of executors registered in the cluster.
> Main motivations for adding of those methods:
> 1. It is the best practice to manage job parallelism relative to available
> cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an
> anti-pattern to leave a bunch of cores on large clusters twiddling their
> thumb & doing nothing. Usually users pass predefined constants for
> _repartition()_ and _coalesce()_. Selection of the constant is based on
> current cluster size. If the code runs on another cluster and/or on the
> resized cluster, they need to modify the constant each time. This happens
> frequently when a job that normally runs on, say, an hour of data on a small
> cluster needs to run on a week of data on a much larger cluster.
> 2. *spark.default.parallelism* can be used to get total number of cores in
> the cluster but it can be redefined by user. The info can be taken via
> registration of a listener but repeating the same looks ugly. We should
> follow the DRY principle.
> 3. Regarding to executorsCount(), some jobs, e.g., local node ML training,
> use a lot of parallelism. It's a common practice to aim to distribute such
> jobs such that there is one partition for each executor.
>
> 4. In some places users collect this info, as well as other settings info
> together with job timing (at the app level) for analysis. E.g., you can use
> ML to determine optimal cluster size given different objectives, e.g.,
> fastest throughput vs. lowest cost per unit of processing.
> 5. The simpler argument is that basic cluster properties should be easily
> discoverable via APIs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]