[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
maropu commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-657166984 Thanks for your interest, @c21 > (1).Is there a reason why we don't cover ShuffledHashJoin as well? (we are seeing in production, people also use ShuffledHashJoin a lot for joining bucketed tables when one side is small) As you said in (3), too, I think that's because there is a concern where coalescing might hurt the parallelism. You can see the related discussion in the history: https://github.com/apache/spark/pull/28123#discussion_r427073319 As for (1) and (3), IMO its worth digging into it for more improvements. > (2).Per this PR, the ordering property of coalesced bucket files does not preserve (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L317), and the ordering can be preserved through a sort-merge-way read of all sorted buckets file. This can help when reading multiple partitions of bucketed table. I think that's the long-standing issue we have. Have you checked the discussion in SPARK-24528? If you're interested in the issue, you can revisit it there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
maropu commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-646894724 Thanks! I appreciate your hard work, @imback82 ! Merged to master. Also, thanks for the reviews, @cloud-fan @viirya ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
maropu commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-643500443 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
maropu commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-643251602 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
maropu commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-643107248 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org