alamb opened a new issue, #10965: URL: https://github.com/apache/datafusion/issues/10965
### Is your feature request related to a problem or challenge? While working on https://github.com/apache/datafusion/issues/10926 @marvinlanhenke has [an excellent question](https://github.com/apache/datafusion/pull/10946#discussion_r1642676938): > Why do we return row counts for a non existing column in row_group_row_counts? (this was because there was some inconsistency between data pages and row counts) The reason `StatisticsConverter::row_group_counts` returns row counts even for a non existent column is because it is the API needed for PruningStatistics here: https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L386-L390 It is possible to do because the `ParquetMetadata` knows how many row groups there are even when there are no row group statistics, but it doesn't make logical sense. Furthermore, for data pages, it is different if the "page index" is not present as then it doesn't even make sense to ask how many rows are in each data page as we don't have any data pages Thus I think `row_group_row_counts` should also default to returning `None` if the column is not present, as @marvinlanhenke has done for `null_counts_page_statistics` in So I guess my new proposal would be to return `Option` like: ```rust impl<'a> StatisticsConverter<'a> { ... /// return OK(None) if the column does not exist pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> { ... } /// return OK(None) if the column does not exist pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, row_group_metadatas: &[RowGroupMetaData], row_group_indices: I, ) -> Result<Option<UInt64Array>> where I: IntoIterator<Item = &'a usize>, { ... } } ``` The rationale to return an Option rather than an error is that creating and ignoring `DataFusionError` via `ok()` still requires a string allocation, which is wasteful I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
