Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 @mridulm your comments make an implicit assumption, which is quite incorrect: that Spark users read the Spark codebase and/or are aware of Spark internals. Please, consider this PR in the context of its intended audience who (a) do not read the source code and (b) hardly look at the API docs. What they read are things like Stack Overflow, the Databricks Guide, blog posts and (quite rarely) the occasional how-to-with-Spark book. The fact that something is possible with Spark doesn't make it easy or intuitive. The value of this PR is that it makes a common use case easy and intuitive. Let's consider the practicality of your suggestions: > Rely on defaultParallelism - this gives the expected result, unless explicitly overridden by user. That doesn't address the core use case as the scope of change & effect is very different. In the targeted use cases, a user wants to explicitly control the level of parallelism relative to the current cluster physical state for potentially a single stage. Relying on `defaultParallelism` exposes the user to undesired side-effects as the setting can be changed by other, potentially unrelated code the user has no control over. Introducing unintended side effects, which your suggestion does, is poor design. > If you need fine grained information about executors, use spark listener (it is trivial to keep a count with onExecutorAdded/onExecutorRemoved). I'd suggest you reconsider your definition of "trivial". Normal Spark users, not people who work on Spark or at companies like Hortonworks whose job is to be Spark experts, have no idea what a listener is, have never hooked one and never will. Not to mention how much fun it is to do this from, say, R. > If you simply want a current value without own listener - use REST api to query for current executors. This type of suggestion is a prime example of ignoring Spark user concerns. You are comparing `sc.numExecutors` with: 1. Knowing that a REST API exists that can produce this result. 2. Learning the details of the API. 3. Picking a synchronous REST client in the language they are using Spark with. 4. Initializing the REST client with the correct endpoint which they obtain... somehow. 5. Formulating the request. 6. Parsing the response. I don't think there is any need to say more about this suggestion. Taking a step back, it is important to acknowledge that Spark has become a mass-market data platform product and start designing user-facing APIs with this in mind. If the teams I know are any indication, the majority of Spark users are not experienced backend/data engineers. They are data scientists and data hackers: people who are getting into big data via Spark. The imbalance is only going to grow. The criteria by which user-focused Spark APIs are evaluated should evolve accordingly. From an ease-of-use perspective, I'd argue the two new methods should be exposed to `SparkSession` also as this is the typical new user "entry point". For example, the data scientists on my team never use `SparkContext` but they do adjust stage parallelism via implicits equivalent to the ones proposed in this PR (to significant benefit in query execution performance).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org