c21 edited a comment on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-657143464


   Thanks @imback82 for making this change!
   
   Sorry for late comment, just a few questions:
   
   (1).Is there a reason why we don't cover ShuffledHashJoin as well? (we are 
seeing in production, people also use ShuffledHashJoin a lot for joining 
bucketed tables when one side is small)
   
   (2).Per this PR, the ordering property of coalesced bucket files does not 
preserve 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L317),
 and the ordering can be preserved through a sort-merge-way read of all sorted 
buckets file. This can help when reading multiple partitions of bucketed table.
   
   (3).We are seeing in production, coalescing might hurt the parallelism, if 
the number of buckets are too few. Another way to avoid shuffle and sort, is to 
split/divide the table with less buckets. E.g. joining tables with t1 (8 
buckets) and t2 (32 buckets), we can keep number of tasks to be 32, and each 
task for reading t1 table will have a filter at run-time to only keep its 
portion of table (divide the table with less buckets). This has downside of 
reading the t1 more than once from multiple tasks, but if the size of t1 is not 
big, it's a good trade off to have more parallelism (and may be better than 
shuffling t1 directly).
   
   We are running above 3 features years in facebook 
(https://databricks.com/session_eu19/spark-sql-bucketing-at-facebook), and I 
would like to make or help the followup change if this sounds reasonable for 
everyone. cc @imback82, @cloud-fan, @maropu , @viirya, @gatorsmile and 
@sameeragarwal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to