sdf-jkl commented on code in PR #9118:
URL: https://github.com/apache/arrow-rs/pull/9118#discussion_r2725814652
##########
parquet/src/arrow/arrow_reader/read_plan.rs:
##########
@@ -175,6 +180,21 @@ impl ReadPlanBuilder {
Ok(self)
}
+ /// Add offset index metadata for each column in a row group to this
`ReadPlanBuilder`
+ pub fn with_offset_index_metadata(
Review Comment:
I came up with a counter example where taking offsets from the col with
finest offsets doesn't work.
```
┏━━━━┓ ┌────────┐ ┌────────┐
- '1' means selected ┃ 0 ┃ │ Row 0 │ │ Row 0 │
- '0' means filtered ┃ 0 ┃ │ Row 1 │ A Page 0 │ Row 1 │
┃ 0 ┃ │ Row 2 │ (skipped) │ Row 2 │
┃ ┃ └────────┘ │ Row 3 │ B Page 0
┃ 1 ┃ ┌────────┐ └────────┘
┃ 1 ┃ │ Row 3 │ A Page 1 ┌────────┐
┃ 1 ┃ │ Row 4 │ (fetched) │ Row 4 │
┃ 0 ┃ │ Row 5 │ │ Row 5 │
┃ ┃ └────────┘ │ Row 6 │ B Page
1 (skipped)
┃ 0 ┃ ┌────────┐ │ Row 7 │
┃ 0 ┃ │ Row 6 │ A Page 2 └────────┘
┃ 0 ┃ │ Row 7 │ ┌────────┐
┃ 0 ┃ │ Row 8 │ │ Row 8 │ B Page
2 (skipped)
┃ ┃ └────────┘ │ Row 9 │
┗━━━━┛ └────────┘
Mask chunking uses A's finest boundary:
- At mask_start = row 3, next A page boundary = row 6
- Chunk reads rows 3–5
But Column B has 4-row pages:
- rows 0–3 in B Page 0 (fetched)
- rows 4–7 in B Page 1 (skipped)
→ rows 4–5 are in a skipped B page → invalid offset
```
Could go back to creating a vec of all page offsets and looking up closest
page end for a mask chunk.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]