alamb opened a new pull request, #10813:
URL: https://github.com/apache/datafusion/pull/10813

   
   
   ## Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion/issues/9929
   
   
   ## Rationale for this change
   
   Many query engines / use cases have some sort of a specialized index for 
data stored in parquet. This index can be used to determine which row groups / 
selections within a file are needed
   
   However, the DataFusion `ParquetExec` has no way for users to pass this 
information in. Instead it tries to prune row groups based on the min/max 
statistics and other information in the file's metadata.
   
   This PR makes it possible for users to pass in a `ParquetAccessPlan` to 
`ParquetExec` with a starting plan, which is then further pruned based on the 
file's metadata.
   
   
   ## What changes are included in this PR?
   1. Allow users to pass in a `ParquetAccessPlan` for each `PartitionedFile` 
read by `ParquetExec`
   2. Add error checking to `ParquetAccessPlan` now that it can be specified by 
users
   2. Document how this works
   3. Add tests for this new API
   
   
   ## Are these changes tested?
   Yes, new tests are added
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to