sahuagin opened a new issue, #9785:
URL: https://github.com/apache/arrow-rs/issues/9785

   Range predicates (e.g. `value >= X`) applied to DELTA_BINARY_PACKED columns
   currently require decoding every value before filtering. The DELTA format's
   per-miniblock headers (`min_delta`, `bit_width`) make it possible to compute 
a
   conservative `[lo, hi]` range for an entire miniblock without decoding
   individual values — if the predicate rejects the range, the miniblock can be
   skipped in O(1).
   
   **Proposed addition:** A new provided method 
`Decoder::scan_filtered(num_values,
   out, predicate)` on the `Decoder` trait. The default implementation ignores 
the
   predicate and decodes everything (safe fallback for all encodings that don't
   carry per-region metadata). `DeltaBitPackDecoder` overrides it to:
   
   1. Check `first_value` as a point range `[v, v]` before entering the 
miniblock loop.
   2. Per miniblock, compute a conservative `[lo, hi]` using `last_value`,
      `min_delta`, `bit_width`, and the count of values in the miniblock.
   3. If the predicate rejects `[lo, hi]`, skip the miniblock (O(1) for bw=0;
      decode-to-track-last_value for mid-stream bw>0 misses; bit-skip for 
terminal
      bw>0 misses).
   4. If the predicate accepts, decode and emit the miniblock normally.
   
   Returns `(values_emitted, values_consumed)`.
   
   **Measured improvement:**
   - Wall-time scan on 1M-row monotone DELTA column: 1.96ms → 470µs (4.2x)
   
   **Note:** `scan_filtered` is intentionally conservative — a range miss on a
   miniblock is always safe (false positives are possible, false negatives are 
not).
   Callers that need exact per-value filtering should apply a second pass on the
   emitted values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to