Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    @MaxGekk The example you cites is literally one of a handful of usages 
which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few 
others are in mllib, typically for reading schema.
    
    I will reiterate the solutions available to users currently:
    * Rely on `defaultParallelism` - this gives the expected result, unless 
explicitly overridden by user.
    * If you need fine grained information about executors, use spark listener 
(it is trivial to keep a count with `onExecutorAdded`/`onExecutorRemoved`).
    * If you simply want a current value without own listener - use REST api to 
query for current executors.
    
    Having said this, I will caution against this approach if you are concerned 
about performance. `defaultParallelism` exists to give a default when user does 
not explicitly override when creating an `RDD` : and reflects the current 
number of executors.
    Particularly when dynamic resource allocation is enabled, this value is not 
optimal : spark will acquire or release resources based on pending tasks.
    
    Using available cluster resources (from cluster manager - not spark) as a 
way to model parallelism would be a better approach : externalize your config's 
and populate based on resources available to application (in your example: 
difference between test/staging/production).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to