cloud-fan commented on PR #44133: URL: https://github.com/apache/spark/pull/44133#issuecomment-1895201489
The design of bucketed tables is that the data is pre-partitioned by some keys. If the join/grouping keys happen to be the bucketed keys, we can save the shuffle. What if the join/grouping keys are bucketed keys with cast? I think the cast is fine if we don't break the data partitioning, which means the same data after cast are still in the same partition. This will be true for upcast (the value doesn't change, only the type changes), but not downcast (`a.intCol = try_cast(b.bigIntCol AS int)`). After down cast, data from different partitions can result to null, which should be in the same partition for join/grouping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
