alamb commented on code in PR #10946: URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642795053
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> { pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, - row_group_metadatas: &[RowGroupMetaData], + row_group_metadatas: &'a [RowGroupMetaData], row_group_indices: I, - ) -> Result<ArrayRef> + ) -> Result<UInt64Array> where I: IntoIterator<Item = &'a usize>, { - let data_type = self.arrow_field.data_type(); - let Some(parquet_index) = self.parquet_index else { - return Ok(self.make_null_array(data_type, row_group_indices)); + // no matching column found in parquet_index; + // thus we cannot extract page_locations in order to determine + // the row count on a per DataPage basis. + // We use `row_group_row_counts` instead. + return Self::row_group_row_counts(row_group_metadatas); Review Comment: > Why do we return row counts for a non existing column in row_group_row_counts? I think the (not great) reason is that it is the API needed for PruningStatistics here: https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L386-L390 Also, since the `ParquetMetadata` knows how many row groups there are even when there are no row group statistics, it is possible For data pages, it is different as if the "page index" is not present then I don't think there is any way to know how many data pages there are other than reading the file The more we explore this, the more you have convinced me that we shouldn't return row counts for non existent columns in `row_group_row_counts` either So I guess my new proposal would be to return `Option` like: ```rust impl<'a> StatisticsConverter<'a> { ... /// return OK(None) if the column does not exist pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> { ... } /// return OK(None) if the column does not exist pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, row_group_metadatas: &[RowGroupMetaData], row_group_indices: I, ) -> Result<Option<UInt64Array>> where I: IntoIterator<Item = &'a usize>, { ... } } ``` The rationale to return an Option rather than an error is that creating and ignoring `DataFusionError` via `ok()` still requires a string allocation, which is wasteful I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org