WweiL commented on PR #45971:
URL: https://github.com/apache/spark/pull/45971#issuecomment-2052419964

   > the seed might behave differently across runs/on different machines
   
   Ah I see, this indeed makes sense. 
   
   In this case, I think we should fix the generator of rows. It's okay to 
sacrifice randomness of rows here.
   We can have a dedicated row generation function, depending on the input 
type, this function just return a fixed return (e.g. if input is int, just 
return 233, if input is byte, just return 0xdeadbeef)
   
   Giving up randomness of the row should still get the job done. The way the 
hash is computed is sth like `hash(field 1, hash (field 2, seed...)...)`, and 
this part hasn't been touched likely since the beginning.
   
https://github.com/apache/spark/blob/6ee662c28ffb0deb70f08a971f9c1869288d39ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L289-L298
   
   This should not be changed, even if it is changed, it makes less sense to be 
changed to a hash function that hash (1, 2) and (2, 1) to be the same bucket. 
   
   Then as long as we increase the max number of field and the number of 
schemas (now this can be a fairly high number, also pay attention to the test 
run time), it should behave similar as having a large number of random 
generated rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to