so repartition() would look at some other config
(spark.sql.adaptive.advisoryPartitionSizeInBytes) to decide the size to use to
partition it on then? Does it require AQE? If so what does a repartition()
call do if AQE is not enabled? this is essentially a new api so would
repartitionBySize or something be less confusing to users who already use
repartition(num_partitions).
Tom
On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan <[email protected]>
wrote:
Ideally this should be handled by the underlying data source to produce a
reasonably partitioned RDD as the input data. However if we already have a
poorly partitioned RDD at hand and want to repartition it properly, I think an
extra shuffle is required so that we can know the partition size first.
That said, I think calling `.repartition()` with no args is indeed a good
solution for this problem.
On Sat, May 22, 2021 at 1:12 AM mhawes <[email protected]> wrote:
Adding /another/ update to say that I'm currently planning on using a
recently introduced feature whereby calling `.repartition()` with no args
will cause the dataset to be optimised by AQE. This actually suits our
use-case perfectly!
Example:
sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
Dataset<Long> dataset = sparkSession.range(1, 4, 1,
4).repartition();
assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
// true
Relevant PRs/Issues:
[SPARK-31220][SQL] repartition obeys initialPartitionNum when
adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when
AQE is enabled https://github.com/apache/spark/pull/28900
[SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
sql when AQE is enabled https://github.com/apache/spark/pull/28952
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]