[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

markhamstra Wed, 18 Jul 2018 16:53:10 -0700

Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    I don't accept you assertions of what constitutes the majority and minority 
of Spark users or use cases or their relative importance. As a long-time 
maintainer of the Spark scheduler, it is also not my concern to define which 
Spark users are important or not, but rather to foster system internals and a 
public API that benefit all users.
    
    I already have pointed out with some specificity how exposing the 
scheduler's low-level accounting of the number of cores or executors that are 
available at some point can encourage anti-patterns and sub-optimal Job 
execution. Repartitioning based upon a snapshot of the number of cores 
available cluster-wide is clearly not the correct thing to do in many instances 
and use cases. Beyond concern for users, as a developer of Spark internals, I 
don't appreciate being pinned to particular implementation details by having 
them directly exposed to users.
    
    And I'll repeat, this JIRA and PR look to be defining the problem to fit a 
preconception of the solution. Even for the particular users and use cases 
targeted by this PR, I wouldn't expect that those users would embrace "I can't 
repartition based upon the scheduler's notion of the number of cores in the 
cluster at some point" as a more accurate statement of their problem than "My 
Spark Jobs don't use all of the CPU resources that I am entitled to use." Even 
if we were to stipulate that in a `repartition` call is inherently the only or 
best place to try to address that real user problem (and I far from convinced 
that this is the only or best approach), then I'd be far happier with extending 
the `repartition` API to include declarative goals than exposing to users only 
part of what is needed from Spark's internal to figure out what is the best 
repartitioning -- perhaps something along the lines of 
`repartition(MaximizeCPUs)` or other appropriate policy/goal enumerations.
    
    And spark packages are not irrelevant here. In fact, a large part of their 
motivation was to handle extensions that are not appropriate for all users or 
to prove out ideas and APIs that are not yet clearly appropriate for inclusion 
in Spark itself.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Reply via email to