mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-696786519


   > While I see the caller of hashUnsafeWords(), I am not sure what hash value 
is expected.
   
   It's a good question. Right now the only users of `hashUnsafeWords` appear 
to be `UnsafeRow` and `BytesToBytesMap`. `UnsafeRow` uses 8 byte slots to store 
all primitive values. `BytesToBytesMap` appears to enforce that keys are 8 byte 
aligned. In both cases I am fairly certain they are calling `hashUnsafeWords` 
rather than `hashUnsafeBytes` simply as an optimization since they know the 
input is 8 byte aligned.
   
   For `UnsafeRow` in particular sub-8-byte types do not appear to be extended 
to 8 byte types, just left aligned in the same slot. For example, a `float` 
would be left aligned in the slot rather than converted to a `double` to fill 
the slot. A word-by-word hash of them would therefore still produce different 
results on big- and little-endian systems.
   
   My current understanding is therefore that `hashUnsafeWords` and 
`hashUnsafeBytes` should produce identical results.
   
   Aside: the use of 'word' here is in my opinion ambiguous. Is it 32 bits on a 
32 bit system? I think that if we wanted to be able to hash an array of 8 byte 
values as if they were encoded in little-endian byte order we should name the 
method after the Java type identifier, so something like `hashUnsafeLongs`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to