etseidl commented on code in PR #9769:
URL: https://github.com/apache/arrow-rs/pull/9769#discussion_r3113974576
##########
parquet/src/encodings/decoding.rs:
##########
@@ -847,6 +882,30 @@ where
self.values_left -= 1;
}
+ // Terminal skip: caller is discarding all remaining values on this
page.
+ // last_value will never be read again, so we can use O(1) arithmetic
+ // skips (BitReader::skip) instead of decoding through get_batch.
+ let terminal = to_skip >= self.values_left + skip;
Review Comment:
I think all of the "terminal" logic can move before taking `first_value`. If
`terminal` is true, we don't even need to take `first_value`.
##########
parquet/src/encodings/decoding.rs:
##########
@@ -862,55 +921,191 @@ where
let bit_width = self.mini_block_bit_widths[self.mini_block_idx] as
usize;
self.check_bit_width(bit_width)?;
let mini_block_to_skip = self.mini_block_remaining.min(to_skip -
skip);
- let mini_block_should_skip = mini_block_to_skip;
-
- let skip_count = self
- .bit_reader
- .get_batch(&mut skip_buffer[0..mini_block_to_skip], bit_width);
- if skip_count != mini_block_to_skip {
- return Err(general_err!(
- "Expected to skip {} values from mini block got {}.",
- mini_block_batch_size,
- skip_count
- ));
- }
-
- // see commentary in self.get() above regarding optimizations
let min_delta = self.min_delta.as_i64()?;
if bit_width == 0 {
- // if min_delta == 0, there's nothing to do. self.last_value
is unchanged
+ // All remainders are zero: every delta equals min_delta
exactly.
+ // Advance last_value by n * min_delta with no bit reads.
Review Comment:
The new comments here do not address the `min_delta == 0` case.
##########
parquet/src/encodings/decoding.rs:
##########
@@ -847,6 +882,30 @@ where
self.values_left -= 1;
}
+ // Terminal skip: caller is discarding all remaining values on this
page.
+ // last_value will never be read again, so we can use O(1) arithmetic
+ // skips (BitReader::skip) instead of decoding through get_batch.
+ let terminal = to_skip >= self.values_left + skip;
+
+ if terminal {
+ while skip < to_skip {
Review Comment:
I think this can simply set `self.values_left` to 0, and perhaps take
`first_value` just in case. The only reason for stepping through the headers is
to do validation, but if we're skipping anyway, I think we can just ignore
invalid data.
##########
parquet/src/encodings/decoding.rs:
##########
@@ -862,55 +921,191 @@ where
let bit_width = self.mini_block_bit_widths[self.mini_block_idx] as
usize;
self.check_bit_width(bit_width)?;
let mini_block_to_skip = self.mini_block_remaining.min(to_skip -
skip);
- let mini_block_should_skip = mini_block_to_skip;
-
- let skip_count = self
- .bit_reader
- .get_batch(&mut skip_buffer[0..mini_block_to_skip], bit_width);
- if skip_count != mini_block_to_skip {
- return Err(general_err!(
- "Expected to skip {} values from mini block got {}.",
- mini_block_batch_size,
- skip_count
- ));
- }
-
- // see commentary in self.get() above regarding optimizations
Review Comment:
Not sure why this comment was dropped, please restore
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]