aviralgarg05 opened a new pull request, #19890:
URL: https://github.com/apache/datafusion/pull/19890

   ## Which issue does this PR close?
   
   - Closes #19839.
   
   ## Rationale for this change
   
   The 
[ParquetOpener](cci:2://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:69:0-122:1)
 was using `ArrowReaderOptions::with_page_index(true)`, which internally sets 
`PageIndexPolicy::Required`. This caused sparse column chunk reads with row 
selection masks to fail with errors like "Invalid offset in sparse column chunk 
data" when reading Parquet files that lack page index metadata.
   
   Relaxing this policy to `PageIndexPolicy::Optional` allows DataFusion to 
gracefully handle files both with and without page index metadata while still 
leveraging the index when it exists.
   
   ## What changes are included in this PR?
   
   - Modified 
[datafusion/datasource-parquet/src/opener.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:0:0-0:0)
 to use `PageIndexPolicy::Optional` instead of `Required`.
   - Added a new regression test in 
[datafusion/core/tests/parquet/issue_19839.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/core/tests/parquet/issue_19839.rs:0:0-0:0)
 that validates reading a Parquet file written without a page index.
   
   ## Are these changes tested?
   
   Yes. I have added a dedicated regression test case:
   - 
[datafusion/core/tests/parquet/issue_19839.rs](cci:7://file:///Users/aviralgarg/Everything/datafusion/datafusion/core/tests/parquet/issue_19839.rs:0:0-0:0)
   
   This test writes a Parquet file specifically without page index metadata and 
verifies that 
[ParquetOpener](cci:2://file:///Users/aviralgarg/Everything/datafusion/datafusion/datasource-parquet/src/opener.rs:69:0-122:1)
 can read it successfully when `parquet_page_index_pruning` is enabled.
   
   ## Are there any user-facing changes?
   
   No. This is a bug fix that improves the robustness of the Parquet reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to