k8ika0s commented on PR #48211: URL: https://github.com/apache/arrow/pull/48211#issuecomment-3568400422
@Vishwanatha-HD Bloom filters are one of those parts of Parquet where tiny byte-order details end up mattering way more than you’d expect, so it’s good to see attention landing here. Something I ran into on s390x is that the xxhash input/output tends to stay a lot more predictable if the bitset words are kept in a single canonical order (LE in our case) and the reader/writer treat them as such. In my own experiments I normalized the bitset once at the boundary and let the rest of the logic operate on native values. In this patch, the per-word `FromLittleEndian`/`ToLittleEndian` inside the find/insert loops definitely keeps things correct, though it does create a slightly tighter coupling between the hashing logic and the byte-swapping. I only mention it because it can sometimes show up in profiling when bloom filters are exercised heavily over wide row groups. Not calling this out as a problem — the behavior you’re targeting here lines up with what I’ve seen on s390x, especially around making sure the mask checks behave the same across BE/LE hosts. Just sharing observations in case it’s useful while these pieces get polished. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
