etseidl opened a new issue, #37939: URL: https://github.com/apache/arrow/issues/37939
### Describe the enhancement requested The current implementation of `DeltaBitPackEncoder` uses unsigned arithmetic to handle possible overflow when calculating deltas (see [here](https://github.com/apache/arrow/blob/e9730f5971480b942c7394846162c4dfa9145aa9/cpp/src/parquet/encoding.cc#L2216)). This has unfortunate consequences when encoding small negative deltas. As an example, writing a vector with values `{1, 0, -1, 0, 1, 0, -1, 0, 1}` produces the following output (starting at the delta binary header): ``` 00000030: 8001 0409 0202 ................ 00000040: 2000 0000 feff ffff feff ffff 0000 0000 ............... 00000050: 0000 0000 feff ffff feff ffff 0000 0000 ................ 00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000c0: 0000 0000 ``` The encoder uses a bit width of 32 for all values. If signed values are used instead, then the result is: ``` 00000030: 8001 0409 0201 0200 ................ 00000040: 0000 a0a0 0000 0000 0000 ``` Here the encoder can use 2 bits per value. This can result in much smaller files, especially in cases where the logical type is less than 32 bits. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
