AntoinePrv commented on PR #47573: URL: https://github.com/apache/arrow/pull/47573#issuecomment-3406552924
One hypothesis I'm wondering about is whether mixing scalar code with SIMD introduce additional latency. For comparison, the Lemire implementation I'm curious about is loading data once with a `load_unaligned`, then using a swizzle (byte reorder), then a rhsift, then a mask. For small sizes, we could even make multiple shifts per swizzle and multiple swizzle per read (the extreme case being bit_width=1 where we can read once ~256 bits and write unpacked 256 values). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
