[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

ssimeonov Wed, 18 Jul 2018 14:53:01 -0700

Github user ssimeonov commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    @mridulm your comments make an implicit assumption, which is quite 
incorrect: that Spark users read the Spark codebase and/or are aware of Spark 
internals. Please, consider this PR in the context of its intended audience who 
(a) do not read the source code and (b) hardly look at the API docs. What they 
read are things like Stack Overflow, the Databricks Guide, blog posts and 
(quite rarely) the occasional how-to-with-Spark book. The fact that something 
is possible with Spark doesn't make it easy or intuitive. The value of this PR 
is that it makes a common use case easy and intuitive.
    
    Let's consider the practicality of your suggestions:
    
    > Rely on defaultParallelism - this gives the expected result, unless 
explicitly overridden by user.
    
    That doesn't address the core use case as the scope of change & effect is 
very different. In the targeted use cases, a user wants to explicitly control 
the level of parallelism relative to the current cluster physical state for 
potentially a single stage. Relying on `defaultParallelism` exposes the user to 
undesired side-effects as the setting can be changed by other, potentially 
unrelated code the user has no control over. Introducing unintended side 
effects, which your suggestion does, is poor design.
    
    > If you need fine grained information about executors, use spark listener 
(it is trivial to keep a count with onExecutorAdded/onExecutorRemoved).
    
    I'd suggest you reconsider your definition of "trivial". Normal Spark 
users, not people who work on Spark or at companies like Hortonworks whose job 
is to be Spark experts, have no idea what a listener is, have never hooked one 
and never will. Not to mention how much fun it is to do this from, say, R.
    
    > If you simply want a current value without own listener - use REST api to 
query for current executors.
    
    This type of suggestion is a prime example of ignoring Spark user concerns. 
You are comparing `sc.numExecutors` with:
    
    1. Knowing that a REST API exists that can produce this result.
    2. Learning the details of the API.
    3. Picking a synchronous REST client in the language they are using Spark 
with.
    4. Initializing the REST client with the correct endpoint which they 
obtain... somehow.
    5. Formulating the request.
    6. Parsing the response.
    
    I don't think there is any need to say more about this suggestion.
    
    Taking a step back, it is important to acknowledge that Spark has become a 
mass-market data platform product and start designing user-facing APIs with 
this in mind. If the teams I know are any indication, the majority of Spark 
users are not experienced backend/data engineers. They are data scientists and 
data hackers: people who are getting into big data via Spark. The imbalance is 
only going to grow. The criteria by which user-focused Spark APIs are evaluated 
should evolve accordingly. 
    
    From an ease-of-use perspective, I'd argue the two new methods should be 
exposed to `SparkSession` also as this is the typical new user "entry point". 
For example, the data scientists on my team never use `SparkContext` but they 
do adjust stage parallelism via implicits equivalent to the ones proposed in 
this PR (to significant benefit in query execution performance).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Reply via email to