[
https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584458#comment-17584458
]
Ziqi Liu commented on SPARK-40211:
----------------------------------
I'm actively working on this
> Allow executeTake() / collectLimit's number of starting partitions to be
> customized
> -----------------------------------------------------------------------------------
>
> Key: SPARK-40211
> URL: https://issues.apache.org/jira/browse/SPARK-40211
> Project: Spark
> Issue Type: Story
> Components: Spark Core, SQL
> Affects Versions: 3.4.0
> Reporter: Ziqi Liu
> Priority: Major
>
> Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be
> customized but does not allow for the initial number of partitions to be
> customized: it’s currently hardcoded to {{{}1{}}}.
> We should add a configuration so that the initial partition count can be
> customized. By setting this new configuration to a high value we could
> effectively mitigate the “run multiple jobs” overhead in {{take}} behavior.
> We could also set it to higher-than-1-but-still-small values (like, say,
> {{{}10{}}}) to achieve a middle-ground trade-off.
>
> Essentially, we need to make {{numPartsToTry = 1L}}
> ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481])
> customizable. We should do this via a new SQL conf, similar to the
> {{limitScaleUpFactor}} conf.
>
> Spark has several near-duplicate versions of this code ([see code
> search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
> * SparkPlan
> * RDD
> * pyspark rdd
> Also, in pyspark {{limitScaleUpFactor}} is not supported either. So for
> now, I will focus on scala side first, leaving python side untouched and
> meanwhile sync with pyspark members. Depending on the progress we can do them
> all in one PR or make scala side change first and leave pyspark change as a
> follow-up.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]