mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-696786519
> While I see the caller of hashUnsafeWords(), I am not sure what hash value is expected. It's a good question. Right now the only users of `hashUnsafeWords` appear to be `UnsafeRow` and `BytesToBytesMap`. `UnsafeRow` uses 8 byte slots to store all primitive values. `BytesToBytesMap` appears to enforce that keys are 8 byte aligned. In both cases I am fairly certain they are calling `hashUnsafeWords` rather than `hashUnsafeBytes` simply as an optimization since they know the input is 8 byte aligned. For `UnsafeRow` in particular sub-8-byte types do not appear to be extended to 8 byte types, just left aligned in the same slot. For example, a `float` would be left aligned in the slot rather than converted to a `double` to fill the slot. A word-by-word hash of them would therefore still produce different results on big- and little-endian systems. My current understanding is therefore that `hashUnsafeWords` and `hashUnsafeBytes` should produce identical results. Aside: the use of 'word' here is in my opinion ambiguous. Is it 32 bits on a 32 bit system? I think that if we wanted to be able to hash an array of 8 byte values as if they were encoded in little-endian byte order we should name the method after the Java type identifier, so something like `hashUnsafeLongs`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
