jonded94 commented on PR #9374: URL: https://github.com/apache/arrow-rs/pull/9374#issuecomment-3891211393
Hey, unfortunately this is a bit out of my skill set/expertise, but here is a slightly different fix: [different_bugfix.patch](https://github.com/user-attachments/files/25264302/different_bugfix.patch) ``` diff --git a/parquet/src/column/page.rs b/parquet/src/column/page.rs index f18b296c1..4cfc07a02 100644 --- a/parquet/src/column/page.rs +++ b/parquet/src/column/page.rs @@ -406,7 +406,14 @@ pub trait PageReader: Iterator<Item = Result<Page>> + Send { /// [(#4327)]: https://github.com/apache/arrow-rs/pull/4327 /// [(#4943)]: https://github.com/apache/arrow-rs/pull/4943 fn at_record_boundary(&mut self) -> Result<bool> { - Ok(self.peek_next_page()?.is_none()) + match self.peek_next_page()? { + // Last page in the column chunk - always a record boundary + None => Ok(true), + // A V2 data page is required by the parquet spec to start at a + // record boundary, so the current page ends at one. V2 pages + // are identified by having `num_rows` set in their header. + Some(metadata) => Ok(metadata.num_rows.is_some()), + } } } diff --git a/parquet/src/column/reader.rs b/parquet/src/column/reader.rs index eb2d53fb7..29cb50185 100644 --- a/parquet/src/column/reader.rs +++ b/parquet/src/column/reader.rs @@ -309,20 +309,6 @@ where }); if let Some(rows) = rows { - // If there is a pending partial record from a previous page, - // count it before considering the whole-page skip. When the - // next page provides num_rows (e.g. a V2 data page or via - // offset index), its records are self-contained, so the - // partial from the previous page is complete at this boundary. - if let Some(decoder) = self.rep_level_decoder.as_mut() { - if decoder.flush_partial() { - remaining_records -= 1; - if remaining_records == 0 { - return Ok(num_records); - } - } - } - if rows <= remaining_records { self.page_reader.skip_next_page()?; remaining_records -= rows; diff --git a/parquet/src/file/serialized_reader.rs b/parquet/src/file/serialized_reader.rs index b3b6383f7..254ccb779 100644 --- a/parquet/src/file/serialized_reader.rs +++ b/parquet/src/file/serialized_reader.rs @@ -1158,7 +1158,12 @@ impl<R: ChunkReader> PageReader for SerializedPageReader<R> { fn at_record_boundary(&mut self) -> Result<bool> { match &mut self.state { - SerializedPageReaderState::Values { .. } => Ok(self.peek_next_page()?.is_none()), + SerializedPageReaderState::Values { .. } => match self.peek_next_page()? { + None => Ok(true), + // V2 data pages must start at record boundaries per the parquet + // spec, so the current page ends at one. + Some(metadata) => Ok(metadata.num_rows.is_some()), + }, SerializedPageReaderState::Pages { .. } => Ok(true), } } ``` I validated that the bugfix still lets the test in this PR and also my end-to-end regression test in https://github.com/apache/arrow-rs/pull/9399 succeed (i.e., for the latter I had to remove the `should_panic` macro). If you think the new bugfix would make more sense, I'll commit it. **Claude output about why this new bugfix is potentially faster:** > **The problem** > > The original bugfix in PR #9374 added a 14-line flush_partial check to skip_records() — a generic function monomorphized 8 times (once per parquet physical type). This increased compiled code size > across the entire compilation unit, causing instruction cache effects that slowed down read_records() benchmarks by up to 54% for FixedLenByteArray types, even though read operations never call > skip_records. > > **The optimization: fix the root cause instead** > > The bug's root cause is that at_record_boundary() incorrectly returns false for V2 data pages in sequential reading mode. Per the parquet spec, V2 pages must start at record boundaries. By fixing > at_record_boundary() to return true when the next page is V2, the existing flush_partial mechanism in both read_records and skip_records handles the bug correctly — no new code needed in those hot > functions. > > **Changes:** > > 1. parquet/src/column/page.rs — Default at_record_boundary() now returns true when the next page has num_rows (V2 indicator), not just when there's no next page. > 2. parquet/src/file/serialized_reader.rs — Same fix for the Values state (sequential reading without offset index). > 3. parquet/src/column/reader.rs — Removed the 14-line flush_partial block from skip_records(). The function is now identical to master. > > **Why this eliminates the performance regression** > > - skip_records() returns to its original code size (actually slightly smaller due to net -4 lines) > - No new code in any monomorphized generic function > - The only change is in at_record_boundary(), called once per page load (not in the hot decoding loop) > - The existing flush_partial mechanism at page exhaustion handles the partial record naturally > > **Why the existing mechanism works** > > When has_record_delimiter is true (now correctly set for V2 pages), the existing code at lines 335-340 of reader.rs fires: > ``` > if levels_read == remaining_levels && self.has_record_delimiter { > records_read += decoder.flush_partial() as usize; > } > ``` > This flushes the pending partial record when a page is exhausted during skip, before the loop iterates back to the page-skip shortcut — preventing the over-count that caused the phantom record bug. > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
