marvinlanhenke commented on code in PR #10946: URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642676938
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> { pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, - row_group_metadatas: &[RowGroupMetaData], + row_group_metadatas: &'a [RowGroupMetaData], row_group_indices: I, - ) -> Result<ArrayRef> + ) -> Result<UInt64Array> where I: IntoIterator<Item = &'a usize>, { - let data_type = self.arrow_field.data_type(); - let Some(parquet_index) = self.parquet_index else { - return Ok(self.make_null_array(data_type, row_group_indices)); + // no matching column found in parquet_index; + // thus we cannot extract page_locations in order to determine + // the row count on a per DataPage basis. + // We use `row_group_row_counts` instead. + return Self::row_group_row_counts(row_group_metadatas); Review Comment: ...perhaps looking at the behavior of `row_group_row_count` might help. My main question here is: Why do we return row counts for a non existing column in `row_group_row_counts`? A column that does not exists cannot have valid rows, or a row count in that regard? So returning a null_array would indicate those missing information? Suppose we had just a single column, and for whatever reason, it does not exist in the parquet file (or does not match). With the current implementation we would still return a 'valid' row count in this scenario, accessing a somewhat invalid / useless parquet file? I'm not sure one can follow my train of thought here, however I think we should return a null_array in both cases for `row_groups` and `data_pages` as well. But, I might be missing something. Perhaps you can explain the reasoning why we don't return a null_array in the current `row_group_row_counts` impl @alamb? That would be nice and might help guide our decision here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org