WillAyd commented on PR #326:
URL: https://github.com/apache/arrow-nanoarrow/pull/326#issuecomment-2061681249
Hey @mapleFU - that's great. I didn't read through everything you posted in
that issue but the research is impressive, and certainly beyond what I was able
to accomplish here
If it helps, I noticed in #280 that there was a significant performance
difference with bit unpacking on x86 avoid shifts when trying to unpack bits.
i.e. code like:
```
static inline void PackInt8Shifts(const int8_t* values, volatile uint8_t*
out) {
*out = (values[0] | values[1] << 1 | values[2] << 2 | values[3] << 3 |
values[4] << 4 |
values[5] << 5 | values[6] << 6 | values[7] << 7);
}
```
was showing more than 10x slower when used in a larger Python process than
the more verbose:
```
static inline void PackInt8NoShifts(const int8_t* values, volatile uint8_t*
out) {
*out = (values[0] | ((values[1] + 0x1) & 0x2) | ((values[2] + 0x3) & 0x4) |
((values[3] + 0x7) & 0x8) | ((values[4] + 0xf) & 0x10) |
((values[5] + 0x1f) & 0x20) | ((values[6] + 0x3f) & 0x40) |
((values[7] + 0x7f) & 0x80));
}
```
Unfortunately that performance boost seeemed only work for unpacking, and
only on x86. Joris and Dewey were not able to replicate that speedup on other
architectures, though it was more or less a moot point for them
I didn't feel like the SO post I created was ever answered, but you may find
some value in the comments provided there. Particularly by user Peter Cordes
https://stackoverflow.com/questions/77550709/x86-performance-difference-between-shift-and-add-when-packing-bits
Hope that helps
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]