k8ika0s commented on PR #48211:
URL: https://github.com/apache/arrow/pull/48211#issuecomment-3568400422

   @Vishwanatha-HD
   
   Bloom filters are one of those parts of Parquet where tiny byte-order 
details end up mattering way more than you’d expect, so it’s good to see 
attention landing here.
   
   Something I ran into on s390x is that the xxhash input/output tends to stay 
a lot more predictable if the bitset words are kept in a single canonical order 
(LE in our case) and the reader/writer treat them as such. In my own 
experiments I normalized the bitset once at the boundary and let the rest of 
the logic operate on native values.
   
   In this patch, the per-word `FromLittleEndian`/`ToLittleEndian` inside the 
find/insert loops definitely keeps things correct, though it does create a 
slightly tighter coupling between the hashing logic and the byte-swapping. I 
only mention it because it can sometimes show up in profiling when bloom 
filters are exercised heavily over wide row groups.
   
   Not calling this out as a problem — the behavior you’re targeting here lines 
up with what I’ve seen on s390x, especially around making sure the mask checks 
behave the same across BE/LE hosts. Just sharing observations in case it’s 
useful while these pieces get polished.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to