wangyum commented on PR #44133:
URL: https://github.com/apache/spark/pull/44133#issuecomment-1906223108

   > The design of bucketed tables is that the data is pre-partitioned by some 
keys. If the join/grouping keys happen to be the bucketed keys, we can save the 
shuffle. What if the join/grouping keys are bucketed keys with cast? I think 
the cast is fine if we don't break the data partitioning, which means the same 
data after cast are still in the same partition. This will be true for upcast 
(the value doesn't change, only the type changes), but not downcast (`a.intCol 
= try_cast(b.bigIntCol AS int)`). After down cast, data from different 
partitions can result to null, which should be in the same partition for 
join/grouping.
   
   This is incorrect. Please see: 
https://github.com/apache/spark/blob/69aa727ff495f6698fe9b37e952dfaf36f1dd5eb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L507-L508
   
   Different types use different hash methods which means they are in different 
buckets. For example:
   ```scala
   scala> import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, 
Literal, Murmur3Hash, Pmod}
   import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, Literal, 
Murmur3Hash, Pmod}
   
   scala> import org.apache.spark.sql.types.LongType
   import org.apache.spark.sql.types.LongType
   
   scala> 
   
   scala> println(Pmod(new Murmur3Hash(Seq(Literal(100))), 
Literal(10)).eval(EmptyRow))
   3
   
   scala> println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100), LongType))), 
Literal(10)).eval(EmptyRow))
   0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to