[GitHub] [spark] ulysses-you commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

GitBox Wed, 17 Jun 2020 01:22:17 -0700


ulysses-you commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-645229971



   Actually, AQE `CoalesceShufflePartitions` also use the `defaultParallelism` 
to detect size. List the key code.
   
   ```
   val minPartitionNum = 
conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM)
             .getOrElse(session.sparkContext.defaultParallelism)
   ...
   
   val maxTargetSize = math.max(
         math.ceil(totalPostShuffleInputSize / 
minNumPartitions.toDouble).toLong, 16)
   val targetSize = math.min(maxTargetSize, advisoryTargetSize)
   ```
   
   In other words `COALESCE_PARTITIONS_MIN_PARTITION_NUM` provide a way to 
change the `defaultParallelism` and this behavior just like the 
`sessionDefaultParallelism` what we discuss.
   
   In file split, we already have some config `maxPartitionBytes` and 
`openCostInBytes`. With the correct parallelism, it would work well. It's no 
need to add an another config to control partition size, and if do that we will 
never use the total core resource in spark application.
   
   BTW The file partition split algorithm is similar between 
`CoalesceShufflePartitions` and `FilePartition`, `advisoryTargetSize` just like 
the `maxPartitionBytes`, and we can try to add a `openCostInBytes` in AQE 
`CoalesceShufflePartitions` in future after performance check.
   
   I still think it's needed to control parallelism in session. At least, we 
should add a config to control file parallelism.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

Reply via email to