[GitHub] [spark] ulysses-you commented on pull request #38176: [SPARK-40715][SQL] Support preferring shuffled hash join thought LocalMapThreshold is less than advisory partition size

GitBox Mon, 10 Oct 2022 18:44:15 -0700


ulysses-you commented on PR #38176:
URL: https://github.com/apache/spark/pull/38176#issuecomment-1273981118


   > For example, if ADVISORY_PARTITION_SIZE_IN_BYTES is 64M, the two sides of 
join are : [30M, 30M, 30M, 30M, 30M] and [100M, 100M, 100M, 100M, 100M], 
CoalesceShufflePartitions will transform ShuffleQueryStageExec to 
CustomShuffleReaderExec(stage, ShufflePartitionSpec[130M, 130M, 130M, 130M, 
130M]) , this query should be able to convert to ShuffledHashJoin (not support 
now).
   
   I'm not sure what you mean, join has two shuffle stages so should also have 
two shuffle reads. For your example, CoalesceShufflePartitions won't coalesce 
partition since the sum of two continuous partition size bigger than 64MB.  You 
can find some details in test `ShufflePartitionsUtilSuite`. And this example 
has already supported to convert to shj if you set 
ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD to 64MB.
   
   > If we enable BROADCASTJOIN, I think the data that will be loaded into 
memory and build HashRelation should also less than 
AUTO_BROADCASTJOIN_THRESHOLD, i.e. 
ADAPTIVE_MAX_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD <= 
AUTO_BROADCASTJOIN_THRESHOLD
   
   As I mentioned, even if all partition size less than 
AUTO_BROADCASTJOIN_THRESHOLD, the final partition size of shuffle reads which 
are optimized by CoalesceShufflePartitions will close to 
ADVISORY_PARTITION_SIZE_IN_BYTES. Then it breaks your assumption.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on pull request #38176: [SPARK-40715][SQL] Support preferring shuffled hash join thought LocalMapThreshold is less than advisory partition size

Reply via email to