aviralgarg05 opened a new pull request, #19890: URL: https://github.com/apache/datafusion/pull/19890
## Which issue does this PR close? - Closes #19839. ## Rationale for this change The [ParquetOpener](cci:2://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:69:0-122:1) was using `ArrowReaderOptions::with_page_index(true)`, which internally sets `PageIndexPolicy::Required`. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata. Relaxing this policy to `PageIndexPolicy::Optional` allows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists. ## What changes are included in this PR? - Modified [datafusion/datasource-parquet/src/opener.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:0:0-0:0) to use `PageIndexPolicy::Optional` instead of `Required`. - Added a new regression test in [datafusion/core/tests/parquet/issue_19839.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/core/tests/parquet/issue_19839.rs:0:0-0:0) that validates reading a Parquet file written without a page index. ## Are these changes tested? Yes. I have added a dedicated regression test case: - [datafusion/core/tests/parquet/issue_19839.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/core/tests/parquet/issue_19839.rs:0:0-0:0) This test writes a Parquet file specifically without page index metadata and verifies that [ParquetOpener](cci:2://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:69:0-122:1) can read it successfully when `parquet_page_index_pruning` is enabled. ## Are there any user-facing changes? No. This is a bug fix that improves the robustness of the Parquet reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
