[I] Change `StatisticsConverter::row_group_counts` to return `None` for non existent columns in parquet files [datafusion]

via GitHub Mon, 17 Jun 2024 12:21:57 -0700


alamb opened a new issue, #10965:
URL: https://github.com/apache/datafusion/issues/10965


   ### Is your feature request related to a problem or challenge?
   
   While working on https://github.com/apache/datafusion/issues/10926 
@marvinlanhenke  has [an excellent 
question](https://github.com/apache/datafusion/pull/10946#discussion_r1642676938):
   
   > Why do we return row counts for a non existing column in 
row_group_row_counts?
   
   (this was because there was some inconsistency between data pages and row 
counts)
   
   The reason `StatisticsConverter::row_group_counts` returns row counts even 
for a non existent column is because it is the API needed for PruningStatistics 
here: 
https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L386-L390
   
   It is possible to do because the `ParquetMetadata` knows how many row groups 
there are even when there are no row group statistics, but it doesn't make 
logical sense. Furthermore, for data pages, it is different if the "page index" 
is not present as then it doesn't even make sense to ask how many rows are in 
each data page as we don't have any data pages
   
   Thus I think `row_group_row_counts` should also default to returning `None` 
if the column is not present, as @marvinlanhenke  has done for 
`null_counts_page_statistics` in 
   
   So I guess my new proposal would be to return `Option` like:
   
   ```rust
   impl<'a> StatisticsConverter<'a> {
   ...
     /// return OK(None) if the column does not exist
     pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> 
Result<Option<UInt64Array>> {
     ...
     }
   
     /// return OK(None) if the column does not exist
       pub fn data_page_row_counts<I>(
           &self,
           column_offset_index: &ParquetOffsetIndex,
           row_group_metadatas: &[RowGroupMetaData],
           row_group_indices: I,
       ) -> Result<Option<UInt64Array>>
       where
           I: IntoIterator<Item = &'a usize>,
       {
       ...
       }
   
   }
   ```
   
   The rationale to return an Option rather than an error is that creating and 
ignoring `DataFusionError` via `ok()` still requires a string allocation, which 
is wasteful
   
   I realize this is done many places already in the statistics extraction 
code, but I think for those cases it is meant to make the code resilent to 
incorrectly encoded parquet files rather than something that is "expected" to 
happen
   
   
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Change `StatisticsConverter::row_group_counts` to return `None` for non existent columns in parquet files [datafusion]

Reply via email to