etseidl opened a new issue, #9793:
URL: https://github.com/apache/arrow-rs/issues/9793

   **Describe the bug**
   The `skip` function in `DeltaBitPackDecoder` utilizes a `skip_buffer` to 
decode values needed to keep `last_value` valid. This buffer has a fixed size 
of `32` for INT32 and `64` for INT64. While these are the sizes used by this 
crate when writing, there is nothing in the specification regarding sizes 
beyond the stipulation that block size must be a multiple of 128, and the 
number of elements in a miniblock is a multiple of 32. It is conceivable that a 
writer could be aggressive in its sizing when encoding long runs of the same 
value.
   
   Consider a column of all `0`. `first_value` in the delta header would be 
`0`, all deltas from that value would likewise be `0`, so each miniblock would 
have a bit_width of `0` as well. A "normal" encoder would emit multiple blocks 
of 128 values, with 4 32-element miniblocks, but only the headers would be 
written as there would be no bit-packed data. A clever writer could see this, 
and instead of wastefully writing many headers, it could so alter the initial 
parameters to write a single block with very large miniblocks.
   
   To write 20000 zeros, one could have:
   ```
   delta header:
     block_size = 20096
     num_mini_blocks = 4
     total_values = 20000
     first_value = 0
   
   block header:
     min_delta = 0
     bit_widths = [0, 0, 0, 0]
   
   end-of-page
   ```
   The number of values per miniblock is `5024`, which is divisible by 32, and 
the blocksize `20096` is likewise divisible by 128. Trying to create a slice of 
5024 elements from a vector of size 32 will panic.
   
   Now this particular scenario would not be an issue since the `skip_buffer` 
is no longer used when the miniblock bitwidth is `0`, but there is nothing at 
all stopping some future writer deciding it can more efficiently encode 
miniblocks of 128 elements say, and setting the block size to 512.
   
   To be safe, I think the result of `block_size / mini_blocks_per_block` 
should be used when sizing `skip_buffer`. The only downside is that this leaves 
open the possibility of exhausting memory.
   
   
   **To Reproduce**
   I'll try to produce an example file that breaks the current skip function.
   
   **Expected behavior**
   `skip()` should not panic
   
   **Additional context**
   This is a theoretical possibility. I know of no writers in widespread use 
that do anything like what is described here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to