ulysses-you commented on PR #38176: URL: https://github.com/apache/spark/pull/38176#issuecomment-1273981118
> For example, if ADVISORY_PARTITION_SIZE_IN_BYTES is 64M, the two sides of join are : [30M, 30M, 30M, 30M, 30M] and [100M, 100M, 100M, 100M, 100M], CoalesceShufflePartitions will transform ShuffleQueryStageExec to CustomShuffleReaderExec(stage, ShufflePartitionSpec[130M, 130M, 130M, 130M, 130M]) , this query should be able to convert to ShuffledHashJoin (not support now). I'm not sure what you mean, join has two shuffle stages so should also have two shuffle reads. For your example, CoalesceShufflePartitions won't coalesce partition since the sum of two continuous partition size bigger than 64MB. You can find some details in test `ShufflePartitionsUtilSuite`. And this example has already supported to convert to shj if you set ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD to 64MB. > If we enable BROADCASTJOIN, I think the data that will be loaded into memory and build HashRelation should also less than AUTO_BROADCASTJOIN_THRESHOLD, i.e. ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD <= AUTO_BROADCASTJOIN_THRESHOLD As I mentioned, even if all partition size less than AUTO_BROADCASTJOIN_THRESHOLD, the final partition size of shuffle reads which are optimized by CoalesceShufflePartitions will close to ADVISORY_PARTITION_SIZE_IN_BYTES. Then it breaks your assumption. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
