alamb commented on code in PR #10946: URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642609253
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> { pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, - row_group_metadatas: &[RowGroupMetaData], + row_group_metadatas: &'a [RowGroupMetaData], row_group_indices: I, - ) -> Result<ArrayRef> + ) -> Result<UInt64Array> where I: IntoIterator<Item = &'a usize>, { - let data_type = self.arrow_field.data_type(); - let Some(parquet_index) = self.parquet_index else { - return Ok(self.make_null_array(data_type, row_group_indices)); + // no matching column found in parquet_index; + // thus we cannot extract page_locations in order to determine + // the row count on a per DataPage basis. + // We use `row_group_row_counts` instead. + return Self::row_group_row_counts(row_group_metadatas); Review Comment: I see -- this is a tricky situation where there is no column and thus no information on data pages. Another potential behavior that might make sense here would be to return an error because unlike other functions in `StatisticsConverter` there is no way to "gracefully" ignore missing information Or we could possible return an array with zero rows 🤔 ########## datafusion/core/tests/parquet/arrow_statistics.rs: ########## @@ -1990,3 +2003,38 @@ async fn test_column_not_found() { } .run_col_not_found(); } + +#[tokio::test] +async fn test_column_non_existent() { Review Comment: 💯 for the new tests - thank you @marvinlanhenke -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org