sahuagin opened a new issue, #9783:
URL: https://github.com/apache/arrow-rs/issues/9783

   The initial idea came in 2017 while evaluating the parquet specification for 
a project. In
   particular I got to the Delta Encoding description and noted that the format 
was
   `<min delta> <list of bitwidths of miniblocks> <miniblocks>`. Thinking about 
how a column
   traversal for a scan would occur, the block would be expanded into memory, 
scanned, and then
   tossed. However, there is already enough information in the header to 
determine if a value
   falls between the min and max of the miniblocks in this block — rather than 
decompressing and
   comparing, why not compress the predicate and compare against the compressed 
data?
   
   I didn't have time to explore the idea at the time, but I continued to be 
curious whether it'd
   been implemented as part of the predicate pushdown work. Recently I explored 
this with Claude,
   giving it a research task to see if the idea had been implemented. Informed 
that it hadn't,
   we worked together on a test project to see if the idea had any merit. 
Satisfied that it worked,
   we then integrated it into the codebase. The API change 
(`Decoder::scan_filtered`) is isolated
   to the final PR so it can be targeted independently.
   
   This first PR addresses the simplest case: bw=0 miniblocks in the 
non-terminal skip path.
   
   ----
   
   In `DeltaBitPackDecoder::skip()`, miniblocks with `bit_width=0` currently 
call
   `get_batch` and iterate over 32 or 64 values even though every delta in the
   miniblock equals `min_delta` exactly (remainders are all zero). No bit reads 
are
   needed; the only state update is advancing `last_value` by `n * min_delta`.
   
   **Proposed fix:** When `bit_width == 0` in the non-terminal skip path, 
replace
   the `get_batch` call with a single `wrapping_mul` + `wrapping_add`. When
   `min_delta == 0` the loop body is a no-op and can be skipped entirely.
   
   **Measured improvement (arrow_reader bench, vs upstream HEAD):**
   - bw=0 single-value skip: -21.6%
   - bw=0 increasing-value skip: -24.3%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to