ulysses-you commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-645229971
Actually, AQE `CoalesceShufflePartitions` also use the `defaultParallelism`
to detect size. List the key code.
```
val minPartitionNum =
conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM)
.getOrElse(session.sparkContext.defaultParallelism)
...
val maxTargetSize = math.max(
math.ceil(totalPostShuffleInputSize /
minNumPartitions.toDouble).toLong, 16)
val targetSize = math.min(maxTargetSize, advisoryTargetSize)
```
In other words `COALESCE_PARTITIONS_MIN_PARTITION_NUM` provide a way to
change the `defaultParallelism` and this behavior just like the
`sessionDefaultParallelism` what we discuss.
In file split, we already have some config `maxPartitionBytes` and
`openCostInBytes`. With the correct parallelism, it would work well. It's no
need to add an another config to control partition size, and if do that we will
never use the total core resource in spark application.
BTW The file partition split algorithm is similar between
`CoalesceShufflePartitions` and `FilePartition`, `advisoryTargetSize` just like
the `maxPartitionBytes`, and we can try to add a `openCostInBytes` in AQE
`CoalesceShufflePartitions` in future after performance check.
I still think it's needed to control parallelism in session. At least, we
should add a config to control file parallelism.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]