cloud-fan commented on PR #36995: URL: https://github.com/apache/spark/pull/36995#issuecomment-1226676187
> This means it now relies on Spark's hash function for bucketing though, which could be different from other engines. Let's think about it this way: The v2 data source only needs Spark to local-sort the data by bucket id, which means the required ordering will be a v2 function that generates bucket id. Then the v2 writer generates the bucket id again using the same v2 function during data writing. Or the v2 writer can just use a hash map to keep open file handlers so that Spark doesn't need to sort the data. The extra clustering is only to reduce the number of files we write out. The Spark hash algorithm only matters when reading bucketed tables and trying to avoid shuffles. I think this is handled well already. Spark will shuffle a v2 table scan again if the other side of the join is a normal table scan with shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
