tustvold commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805476068
##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
});
if let Some(rows) = rows {
+ // If there is a pending partial record from a previous
page,
+ // count it before considering the whole-page skip. When
the
+ // next page provides num_rows (e.g. a V2 data page or via
+ // offset index), its records are self-contained, so the
+ // partial from the previous page is complete at this
boundary.
+ if let Some(decoder) = self.rep_level_decoder.as_mut() {
+ if decoder.flush_partial() {
Review Comment:
I haven't been following the conversation very closely, but IIRC records
shouldn't be split across any pages, regardless of V1 or V2. The issue is that
many older writers, including arrow-rs, did occasionally do this, as the spec
at the time was ambiguous. IIRC this was independent of V1 vs V2.
IMO enough time has probably passed that we can just assume that records
aren't split across pages, and error on encountering a page with a non-zero
initial repetition level. Potentially with some flag to disable this, that also
disables things like filter pushdown
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]