cloud-fan edited a comment on issue #26409: [SPARK-29655][SQL] Enable adaptive execution should not add more ShuffleExchange URL: https://github.com/apache/spark/pull/26409#issuecomment-552502824 let's think about the expected behavior. This is truly a cost problem but we should figure out a simple rule as estimating cost is not realistic in Spark for now. 1. we should not blindly avoid shuffle. very few buckets can lead to poor performance as parallelism is low. 2. we should avoid shuffle if the number of buckets is reasonable. Image we join a bucketed table with a big table. We can avoid shuffle if the number of buckets is reasonable as the number of partitions to shuffle the big table. It's hard to define reasonable here and I think it's OK to take the value of `spark.sql.shuffle.partitions` as the reasonable number.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
