sahuagin opened a new issue, #9783: URL: https://github.com/apache/arrow-rs/issues/9783
The initial idea came in 2017 while evaluating the parquet specification for a project. In particular I got to the Delta Encoding description and noted that the format was `<min delta> <list of bitwidths of miniblocks> <miniblocks>`. Thinking about how a column traversal for a scan would occur, the block would be expanded into memory, scanned, and then tossed. However, there is already enough information in the header to determine if a value falls between the min and max of the miniblocks in this block — rather than decompressing and comparing, why not compress the predicate and compare against the compressed data? I didn't have time to explore the idea at the time, but I continued to be curious whether it'd been implemented as part of the predicate pushdown work. Recently I explored this with Claude, giving it a research task to see if the idea had been implemented. Informed that it hadn't, we worked together on a test project to see if the idea had any merit. Satisfied that it worked, we then integrated it into the codebase. The API change (`Decoder::scan_filtered`) is isolated to the final PR so it can be targeted independently. This first PR addresses the simplest case: bw=0 miniblocks in the non-terminal skip path. ---- In `DeltaBitPackDecoder::skip()`, miniblocks with `bit_width=0` currently call `get_batch` and iterate over 32 or 64 values even though every delta in the miniblock equals `min_delta` exactly (remainders are all zero). No bit reads are needed; the only state update is advancing `last_value` by `n * min_delta`. **Proposed fix:** When `bit_width == 0` in the non-terminal skip path, replace the `get_batch` call with a single `wrapping_mul` + `wrapping_add`. When `min_delta == 0` the loop body is a no-op and can be skipped entirely. **Measured improvement (arrow_reader bench, vs upstream HEAD):** - bw=0 single-value skip: -21.6% - bw=0 increasing-value skip: -24.3% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
