etseidl opened a new issue, #37939:
URL: https://github.com/apache/arrow/issues/37939

   ### Describe the enhancement requested
   
   The current implementation of `DeltaBitPackEncoder` uses unsigned arithmetic 
to handle possible overflow when calculating deltas (see 
[here](https://github.com/apache/arrow/blob/e9730f5971480b942c7394846162c4dfa9145aa9/cpp/src/parquet/encoding.cc#L2216)).
 This has unfortunate consequences when encoding small negative deltas. As an 
example, writing a vector with values `{1, 0, -1, 0, 1, 0, -1, 0, 1}` produces 
the following output (starting at the delta binary header):
   ```
   00000030:                          8001 0409 0202  ................
   00000040: 2000 0000 feff ffff feff ffff 0000 0000   ...............
   00000050: 0000 0000 feff ffff feff ffff 0000 0000  ................
   00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
   000000c0: 0000 0000
   ```
   The encoder uses a bit width of 32 for all values. If signed values are used 
instead, then the result is:
   ```
   00000030:                     8001 0409 0201 0200  ................
   00000040: 0000 a0a0 0000 0000 0000
   ```
   Here the encoder can use 2 bits per value. This can result in much smaller 
files, especially in cases where the logical type is less than 32 bits.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to