erratic-pattern opened a new issue, #9239: URL: https://github.com/apache/arrow-rs/issues/9239
## Describe the bug Error: ``` Parquet error: Invalid offset in sparse column chunk data: 754, no matching page found. If you are using a `SelectionStrategyPolicy::Mask`, ensure that the OffsetIndex is provided when creating the InMemoryRowGroup. ``` Occurs when: 1. A predicate uses `RowSelectionStrategy::Selectors` with a `RowSelector` list that skips an entire page. 2. Another predicate uses `RowSelectionStrategy::Mask` by triggering the [mask run-length threshold] 3. The column with `RowSelectionStrategy::Mask` is **not** in the output projection, so [`should_force_selectors`] does not force it to use `RowSelectionStrategy::Selectors` 4. The mask strategy attempts to fetch pages that were skipped, resulting in an error ## To Reproduce A minimal reproducer is available at: https://github.com/erratic-pattern/parquet_mask_strategy_missing_pages ```bash git clone https://github.com/erratic-pattern/parquet_mask_strategy_missing_pages cd parquet_mask_strategy_missing_pages cargo test ``` The test uses a parquet file with: - 2 row groups, 300 rows each - Tag column with values 'a', 'b', 'c' sorted (100 rows each) - Time column with alternating in-range/out-of-range values - Page size set so tag='b' section contains at least one full page The test simulates a query like `SELECT tag WHERE tag IN ('a', 'c') AND time >= X AND time < Y` with three predicates: 1. `tag IN ('a', 'c')` - creates initial selection `[select 100, skip 100, select 100]` 2. `time >= X` - creates sparse selection, pages fetched as Sparse 3. `time < Y` - triggers Mask strategy due to sparse selection from predicate 2 ## Additional context - Introduced in parquet 57.1.0 via https://github.com/apache/arrow-rs/pull/8733 - Related to https://github.com/apache/arrow-rs/issues/8845 [mask run-length threshold]: https://github.com/apache/arrow-rs/blob/57.1.0/parquet/src/arrow/arrow_reader/selection.rs#L47 [`should_force_selectors`]: https://github.com/apache/arrow-rs/blob/57.1.0/parquet/src/arrow/arrow_reader/selection.rs#L275 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
