alamb opened a new issue, #9693:
URL: https://github.com/apache/arrow-rs/issues/9693

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `apache/datafusion` ran into this while working on page pruning in 
[apache/datafusion#21556](https://github.com/apache/datafusion/pull/21556).
   
   Today, 
[`ParquetMetaData::column_index`](https://github.com/apache/arrow-rs/blob/master/parquet/src/file/metadata/mod.rs)
 and 
[`ParquetMetaData::offset_index`](https://github.com/apache/arrow-rs/blob/master/parquet/src/file/metadata/mod.rs)
 return `None` both when the file has no page index and when the page index has 
not been fetched yet. That behavior is tied to how 
[`ParquetMetaDataReader::load_page_index`](https://github.com/apache/arrow-rs/blob/master/parquet/src/file/metadata/reader.rs)
 works today.
   
   That makes it hard for downstream consumers to optimize page-pruning flow. 
In DataFusion, for example, we want to:
   - avoid loading page-index metadata unless there is a usable page-pruning 
predicate
   - avoid building page-pruning predicates when the file has no page index
   
   The first part is possible today. The second is not, because when indexes 
are not already loaded, `None` is ambiguous.
   
   Relevant DataFusion code:
   - 
[`has_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L941-L945)
   - 
[`build_page_pruning_predicate`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L948-L954)
   - 
[`FiltersPreparedParquetOpen::load_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L957-L984)
   - helper that eventually calls Arrow’s metadata loader: 
[`load_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L1717-L1743)
   
   **Describe the solution you'd like**
   
   An API that exposes page-index availability separately from whether the 
actual index payload has been loaded.
   
   Examples:
   - `page_index_state() -> Unknown | Absent | PresentNotLoaded | PresentLoaded`
   - or a smaller API such as `has_page_index() -> Option<bool>` with 
documented semantics
   
   The important part is allowing callers to distinguish:
   - page index absent
   - page index not yet loaded
   
   without requiring an actual page-index load.
   
   **Describe alternatives you've considered**
   
   Downstream consumers can attempt an optional page-index load and infer 
absence from the result, but that forces extra I/O and complicates the pruning 
path.
   
   Consumers can also treat `None` as “unknown” and conservatively proceed, but 
that still means they may do unnecessary predicate construction for files that 
do not have page indexes.
   
   **Additional context**
   
   This would help engines like DataFusion reduce unnecessary work in 
low-latency scan paths and simplify page-pruning control flow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to