Re: [PR] Fix skip_records over-counting when partial record precedes num_rows page skip [arrow-rs]

via GitHub Thu, 12 Feb 2026 06:18:07 -0800


jonded94 commented on PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#issuecomment-3891211393


   Hey, unfortunately this is a bit out of my skill set/expertise, but here is 
a slightly different fix:
   
   
[different_bugfix.patch](https://github.com/user-attachments/files/25264302/different_bugfix.patch)
   
   ```
   diff --git a/parquet/src/column/page.rs b/parquet/src/column/page.rs
   index f18b296c1..4cfc07a02 100644
   --- a/parquet/src/column/page.rs
   +++ b/parquet/src/column/page.rs
   @@ -406,7 +406,14 @@ pub trait PageReader: Iterator<Item = Result<Page>> + 
Send {
        /// [(#4327)]: https://github.com/apache/arrow-rs/pull/4327
        /// [(#4943)]: https://github.com/apache/arrow-rs/pull/4943
        fn at_record_boundary(&mut self) -> Result<bool> {
   -        Ok(self.peek_next_page()?.is_none())
   +        match self.peek_next_page()? {
   +            // Last page in the column chunk - always a record boundary
   +            None => Ok(true),
   +            // A V2 data page is required by the parquet spec to start at a
   +            // record boundary, so the current page ends at one.  V2 pages
   +            // are identified by having `num_rows` set in their header.
   +            Some(metadata) => Ok(metadata.num_rows.is_some()),
   +        }
        }
    }
    
   diff --git a/parquet/src/column/reader.rs b/parquet/src/column/reader.rs
   index eb2d53fb7..29cb50185 100644
   --- a/parquet/src/column/reader.rs
   +++ b/parquet/src/column/reader.rs
   @@ -309,20 +309,6 @@ where
                    });
    
                    if let Some(rows) = rows {
   -                    // If there is a pending partial record from a previous 
page,
   -                    // count it before considering the whole-page skip. 
When the
   -                    // next page provides num_rows (e.g. a V2 data page or 
via
   -                    // offset index), its records are self-contained, so the
   -                    // partial from the previous page is complete at this 
boundary.
   -                    if let Some(decoder) = self.rep_level_decoder.as_mut() {
   -                        if decoder.flush_partial() {
   -                            remaining_records -= 1;
   -                            if remaining_records == 0 {
   -                                return Ok(num_records);
   -                            }
   -                        }
   -                    }
   -
                        if rows <= remaining_records {
                            self.page_reader.skip_next_page()?;
                            remaining_records -= rows;
   diff --git a/parquet/src/file/serialized_reader.rs 
b/parquet/src/file/serialized_reader.rs
   index b3b6383f7..254ccb779 100644
   --- a/parquet/src/file/serialized_reader.rs
   +++ b/parquet/src/file/serialized_reader.rs
   @@ -1158,7 +1158,12 @@ impl<R: ChunkReader> PageReader for 
SerializedPageReader<R> {
    
        fn at_record_boundary(&mut self) -> Result<bool> {
            match &mut self.state {
   -            SerializedPageReaderState::Values { .. } => 
Ok(self.peek_next_page()?.is_none()),
   +            SerializedPageReaderState::Values { .. } => match 
self.peek_next_page()? {
   +                None => Ok(true),
   +                // V2 data pages must start at record boundaries per the 
parquet
   +                // spec, so the current page ends at one.
   +                Some(metadata) => Ok(metadata.num_rows.is_some()),
   +            },
                SerializedPageReaderState::Pages { .. } => Ok(true),
            }
        }
   ```
   
   I validated that the bugfix still lets the test in this PR and also my 
end-to-end regression test in https://github.com/apache/arrow-rs/pull/9399 
succeed (i.e., for the latter I had to remove the `should_panic` macro).
   
   If you think the new bugfix would make more sense, I'll commit it.
   
   **Claude output about why this new bugfix is potentially faster:**
   
   > **The problem**
   > 
   >   The original bugfix in PR #9374 added a 14-line flush_partial check to 
skip_records() — a generic function monomorphized 8 times (once per parquet 
physical type). This increased compiled code size
   >   across the entire compilation unit, causing instruction cache effects 
that slowed down read_records() benchmarks by up to 54% for FixedLenByteArray 
types, even though read operations never call
   >   skip_records.
   > 
   >   **The optimization: fix the root cause instead**
   > 
   >   The bug's root cause is that at_record_boundary() incorrectly returns 
false for V2 data pages in sequential reading mode. Per the parquet spec, V2 
pages must start at record boundaries. By fixing
   >   at_record_boundary() to return true when the next page is V2, the 
existing flush_partial mechanism in both read_records and skip_records handles 
the bug correctly — no new code needed in those hot
   >   functions.
   > 
   >   **Changes:**
   > 
   >   1. parquet/src/column/page.rs — Default at_record_boundary() now returns 
true when the next page has num_rows (V2 indicator), not just when there's no 
next page.
   >   2. parquet/src/file/serialized_reader.rs — Same fix for the Values state 
(sequential reading without offset index).
   >   3. parquet/src/column/reader.rs — Removed the 14-line flush_partial 
block from skip_records(). The function is now identical to master.
   > 
   >   **Why this eliminates the performance regression**
   > 
   >   - skip_records() returns to its original code size (actually slightly 
smaller due to net -4 lines)
   >   - No new code in any monomorphized generic function
   >   - The only change is in at_record_boundary(), called once per page load 
(not in the hot decoding loop)
   >   - The existing flush_partial mechanism at page exhaustion handles the 
partial record naturally
   > 
   >   **Why the existing mechanism works**
   > 
   >   When has_record_delimiter is true (now correctly set for V2 pages), the 
existing code at lines 335-340 of reader.rs fires:
   >   ```
   >   if levels_read == remaining_levels && self.has_record_delimiter {
   >       records_read += decoder.flush_partial() as usize;
   >   }
   >   ```
   >   This flushes the pending partial record when a page is exhausted during 
skip, before the loop iterates back to the page-skip shortcut — preventing the 
over-count that caused the phantom record bug.
   > 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix skip_records over-counting when partial record precedes num_rows page skip [arrow-rs]

Reply via email to