Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 > Repartitioning based upon a snapshot of the number of cores available cluster-wide is clearly not the correct thing to do in many instances and use cases. I wholeheartedly agree and I can't wait for the better approach(es) you proposed. In the meantime, repartitioning to a constant number of partitions, which is what people do today, is a lot worse in most instances and use cases (obviously excluding the situations where a fixed number of partitions is driven by a requirement). In the end, your objections provide absolutely no immediate & practical alternative to an immediate & common problem that faces any Spark user whose jobs execute on clusters of varying size, a problem that meaningfully affects performance and cost. > ... I don't appreciate being pinned ... None of us do, @markhamstra, but that's sometimes how we help others, in this case, the broader Spark user community. > I don't accept your assertions of what constitutes the majority and minority of Spark users or use cases or their relative importance. My claims are based on (a) the constitution of data engineering/science teams at all non-ISV companies whose engineering structures/head counts I know well (7), (b) what multiple recruiters are telling me about hiring trends (East Coast-biased but consistently confirmed when talking to West Coast colleagues) and (c) the audiences at Spark meetups and the Spark Summit where I speak frequently. What is your non-acceptance based on? > As a long-time maintainer of the Spark scheduler, it is also not my concern to define which Spark users are important or not, but rather to foster system internals and a public API that benefit all users. I still do not understand how you evaluate an API. Do you mean you have a way of knowing when a public API benefits all users _without_ understanding how user personas break down by volume and/or by importance? Or, perhaps, you evaluate an API according to how well it serves the "average" user, who must be some strange cross between a Scala Spark committer, a Java data engineer and a Python/R data scientist, or the "average" Spark job, which must be a mix between batch ETL, streaming and ML/AI training? Or, just based on what you feel is right? Your work on the Spark scheduler and its APIs is much appreciated as is your expertise in evolving these APIs over time. However, this PR is NOT about the scheduler API. It is about the public `SparkContext`/`SparkSession` APIs that are exposed to the end users of Spark. @MaxGekk spends his days talking to end users of Spark across dozens if not hundreds of companies. I would argue he has an excellent, mostly unbiased perspective of the life and needs of people using Spark. Do you have an excellent and mostly unbiased perspective of how Spark is used in the real world? You work on Spark internals, which means that you do not spend your days using Spark. Your users are internal Spark developers, not the end users of Spark. You work at a top-notch ISV, a highly technical organization, which is not representative of the broader Spark community. I strongly feel that you are trying to do what's right but have you considered the possibility that @MaxGekk has a much more accurate perspective of Spark user needs, and the urgency of addressing those needs, and that the way you judge this PR is biased by your rather unique perspective and environment? I have nothing more to say on the topic of this PR. No matter which way it goes, I thank @MaxGekk for looking out for Spark users and @mridulm + @markhamstra for trying to do the right thing, as they see it.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org