alamb commented on code in PR #10946:
URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642609253


##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> {
     pub fn data_page_row_counts<I>(
         &self,
         column_offset_index: &ParquetOffsetIndex,
-        row_group_metadatas: &[RowGroupMetaData],
+        row_group_metadatas: &'a [RowGroupMetaData],
         row_group_indices: I,
-    ) -> Result<ArrayRef>
+    ) -> Result<UInt64Array>
     where
         I: IntoIterator<Item = &'a usize>,
     {
-        let data_type = self.arrow_field.data_type();
-
         let Some(parquet_index) = self.parquet_index else {
-            return Ok(self.make_null_array(data_type, row_group_indices));
+            // no matching column found in parquet_index;
+            // thus we cannot extract page_locations in order to determine
+            // the row count on a per DataPage basis.
+            // We use `row_group_row_counts` instead.
+            return Self::row_group_row_counts(row_group_metadatas);

Review Comment:
   I see -- this is a tricky situation where there is no column and thus no 
information on data pages. 
   
   Another potential behavior that might make sense here would be to return an 
error because unlike other functions in `StatisticsConverter` there is no way 
to "gracefully" ignore missing information
   
   Or we could possible return an array with zero rows 🤔 



##########
datafusion/core/tests/parquet/arrow_statistics.rs:
##########
@@ -1990,3 +2003,38 @@ async fn test_column_not_found() {
     }
     .run_col_not_found();
 }
+
+#[tokio::test]
+async fn test_column_non_existent() {

Review Comment:
   💯  for the new tests - thank you @marvinlanhenke 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to