Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

via GitHub Mon, 17 Jun 2024 04:38:35 -0700


marvinlanhenke commented on code in PR #10946:
URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642676938



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> {
     pub fn data_page_row_counts<I>(
         &self,
         column_offset_index: &ParquetOffsetIndex,
-        row_group_metadatas: &[RowGroupMetaData],
+        row_group_metadatas: &'a [RowGroupMetaData],
         row_group_indices: I,
-    ) -> Result<ArrayRef>
+    ) -> Result<UInt64Array>
     where
         I: IntoIterator<Item = &'a usize>,
     {
-        let data_type = self.arrow_field.data_type();
-
         let Some(parquet_index) = self.parquet_index else {
-            return Ok(self.make_null_array(data_type, row_group_indices));
+            // no matching column found in parquet_index;
+            // thus we cannot extract page_locations in order to determine
+            // the row count on a per DataPage basis.
+            // We use `row_group_row_counts` instead.
+            return Self::row_group_row_counts(row_group_metadatas);

Review Comment:
   ...perhaps looking at the behavior of `row_group_row_count` might help.
   
   My main question here is: 
   Why do we return row counts for a non existing column in 
`row_group_row_counts`?
   
   A column that does not exists cannot have valid rows, or a row count in that 
regard? So returning a null_array would indicate those missing information? 
Suppose we had just a single column, and for whatever reason, it does not exist 
in the parquet file (or does not match). With the current implementation we 
would still return a 'valid' row count in this scenario, accessing a somewhat 
invalid / useless parquet file? 
   
   I'm not sure one can follow my train of thought here, however I think we 
should return a null_array in both cases for `row_groups` and `data_pages` as 
well.
   
   But, I might be missing something. Perhaps you can explain the reasoning why 
we don't return a null_array in the current `row_group_row_counts` impl @alamb? 
That would be nice and might help guide our decision here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

Reply via email to