Re: [PR] Fix skip_records over-counting when partial record precedes num_rows page skip [arrow-rs]

via GitHub Fri, 13 Feb 2026 10:14:28 -0800


tustvold commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805476068



##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
                 });
 
                 if let Some(rows) = rows {
+                    // If there is a pending partial record from a previous 
page,
+                    // count it before considering the whole-page skip. When 
the
+                    // next page provides num_rows (e.g. a V2 data page or via
+                    // offset index), its records are self-contained, so the
+                    // partial from the previous page is complete at this 
boundary.
+                    if let Some(decoder) = self.rep_level_decoder.as_mut() {
+                        if decoder.flush_partial() {

Review Comment:
   I haven't been following the conversation very closely, but IIRC records 
shouldn't be split across any pages, regardless of V1 or V2. The issue is that 
many older writers, including arrow-rs, did occasionally do this, as the spec 
at the time was ambiguous. IIRC this was independent of V1 vs V2.
   
   IMO enough time has probably passed that we can just assume that records 
aren't split across pages, and error on encountering a page with a non-zero 
initial repetition level. Potentially with some flag to disable this, that also 
disables things like filter pushdown



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix skip_records over-counting when partial record precedes num_rows page skip [arrow-rs]

Reply via email to