felipecrv commented on code in PR #35814:
URL: https://github.com/apache/arrow/pull/35814#discussion_r1210331733


##########
cpp/src/arrow/util/hashing.h:
##########
@@ -63,6 +63,27 @@ typedef uint64_t hash_t;
 template <uint64_t AlgNum>
 inline hash_t ComputeStringHash(const void* data, int64_t length);
 
+/// \brief A hash function for bitmaps that can handle offsets and lengths in
+/// terms of number of bits. The hash only depends on the bits actually hashed.
+///
+/// The key (a bitmap) is read as a sequence of 64-bit words before the 
trailing
+/// bytes. So if the input is 64-bit aligned, all memory accesses are aligned.
+///
+/// It's the caller's responsibility to ensure that bits_offset + num_bits are
+/// readable from the bitmap.
+///
+/// \pre bits_offset >= 0
+/// \pre num_bits >= 0
+/// \pre (bits_offset + num_bits + 7) / 8 <= length
+///
+/// \param bitmap The pointer to the bitmap.
+/// \param length The length of the bitmap in bytes.
+/// \param seed The seed for the hash function (useful when chaining hash 
functions).
+/// \param bits_offset The offset in bits relative to the start of the bitmap.
+/// \param num_bits The number of bits after the offset to be hashed.
+ARROW_EXPORT hash_t ComputeBitmapHash(const uint8_t* bitmap, int64_t length, 
hash_t seed,

Review Comment:
   The `length` in bytes allows me to `DCHECK` if the it's safe to read that 
number of bits within that range of bytes. I can remove it if you don't think 
this check is worth it.
   
   > Also, no need to pass a seed since we can easily combine separate hash 
values.
   
   This way of combining hashes is part of the MurmurHash design and it is 
inspired by how block ciphers are chained to guarantee that a single bit change 
early in the input has a huge effect on the output (Avalanche Effect).
   
   The xor we are using to combine hashes in our hash functions is modifying 
only the LSB bits (biasing them), probably making the hash output worse as a 
key for hash tables.
   
   ```
          Type::Hash: 0xfc0a5d5d3a4a5d05
             StdHash: 0xfc0a5d5d3a4a5d04 (from 1)
             StdHash: 0xfc0a5d5d3a4a5d06 (from 2)
   ComputeBitmapHash: 0xacf1b1adcf3c8bc5
             StdHash: 0xacf1b1adcf3c8bc5 (from 0)
             StdHash: 0xacf1b1adcf3c8bc7 (from 2)
           ArrayHash: 0xacf1b1adcf3c8bc7
           ArrayHash: 0xacf1b1adcf3c8bc7
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to