cloud-fan commented on PR #36995:
URL: https://github.com/apache/spark/pull/36995#issuecomment-1226676187

   > This means it now relies on Spark's hash function for bucketing though, 
which could be different from other engines.
   
   Let's think about it this way: The v2 data source only needs Spark to 
local-sort the data by bucket id, which means the required ordering will be a 
v2 function that generates bucket id. Then the v2 writer generates the bucket 
id again using the same v2 function during data writing. Or the v2 writer can 
just use a hash map to keep open file handlers so that Spark doesn't need to 
sort the data. The extra clustering is only to reduce the number of files we 
write out.
   
   The Spark hash algorithm only matters when reading bucketed tables and 
trying to avoid shuffles. I think this is handled well already. Spark will 
shuffle a v2 table scan again if the other side of the join is a normal table 
scan with shuffle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to