Github user ssimeonov commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    > Repartitioning based upon a snapshot of the number of cores available 
cluster-wide is clearly not the correct thing to do in many instances and use 
cases.
    
    I wholeheartedly agree and I can't wait for the better approach(es) you 
proposed. In the meantime, repartitioning to a constant number of partitions, 
which is what people do today, is a lot worse in most instances and use cases 
(obviously excluding the situations where a fixed number of partitions is 
driven by a requirement).
    
    In the end, your objections provide absolutely no immediate & practical 
alternative to an immediate & common problem that faces any Spark user whose 
jobs execute on clusters of varying size, a problem that meaningfully affects 
performance and cost.
    
    > ... I don't appreciate being pinned ...
    
    None of us do, @markhamstra, but that's sometimes how we help others, in 
this case, the broader Spark user community.
    
    > I don't accept your assertions of what constitutes the majority and 
minority of Spark users or use cases or their relative importance.
    
    My claims are based on (a) the constitution of data engineering/science 
teams at all non-ISV companies whose engineering structures/head counts I know 
well (7), (b) what multiple recruiters are telling me about hiring trends (East 
Coast-biased but consistently confirmed when talking to West Coast colleagues) 
and (c) the audiences at Spark meetups and the Spark Summit where I speak 
frequently. What is your non-acceptance based on?
    
    > As a long-time maintainer of the Spark scheduler, it is also not my 
concern to define which Spark users are important or not, but rather to foster 
system internals and a public API that benefit all users.
    
    I still do not understand how you evaluate an API. Do you mean you have a 
way of knowing when a public API benefits all users _without_ understanding how 
user personas break down by volume and/or by importance? Or, perhaps, you 
evaluate an API according to how well it serves the "average" user, who must be 
some strange cross between a Scala Spark committer, a Java data engineer and a 
Python/R data scientist, or the "average" Spark job, which must be a mix 
between batch ETL, streaming and ML/AI training? Or, just based on what you 
feel is right?
    
    Your work on the Spark scheduler and its APIs is much appreciated as is 
your expertise in evolving these APIs over time. However, this PR is NOT about 
the scheduler API. It is about the public `SparkContext`/`SparkSession` APIs 
that are exposed to the end users of Spark. @MaxGekk spends his days talking to 
end users of Spark across dozens if not hundreds of companies. I would argue he 
has an excellent, mostly unbiased perspective of the life and needs of people 
using Spark. Do you have an excellent and mostly unbiased perspective of how 
Spark is used in the real world? You work on Spark internals, which means that 
you do not spend your days using Spark. Your users are internal Spark 
developers, not the end users of Spark. You work at a top-notch ISV, a highly 
technical organization, which is not representative of the broader Spark 
community.
    
    I strongly feel that you are trying to do what's right but have you 
considered the possibility that @MaxGekk has a much more accurate perspective 
of Spark user needs, and the urgency of addressing those needs, and that the 
way you judge this PR is biased by your rather unique perspective and 
environment?
    
    I have nothing more to say on the topic of this PR. No matter which way it 
goes, I thank @MaxGekk for looking out for Spark users and @mridulm + 
@markhamstra for trying to do the right thing, as they see it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to