Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

via GitHub Mon, 17 Jun 2024 06:13:43 -0700


alamb commented on code in PR #10946:
URL: https://github.com/apache/datafusion/pull/10946#discussion_r1642795053



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -1041,21 +1043,20 @@ impl<'a> StatisticsConverter<'a> {
     pub fn data_page_row_counts<I>(
         &self,
         column_offset_index: &ParquetOffsetIndex,
-        row_group_metadatas: &[RowGroupMetaData],
+        row_group_metadatas: &'a [RowGroupMetaData],
         row_group_indices: I,
-    ) -> Result<ArrayRef>
+    ) -> Result<UInt64Array>
     where
         I: IntoIterator<Item = &'a usize>,
     {
-        let data_type = self.arrow_field.data_type();
-
         let Some(parquet_index) = self.parquet_index else {
-            return Ok(self.make_null_array(data_type, row_group_indices));
+            // no matching column found in parquet_index;
+            // thus we cannot extract page_locations in order to determine
+            // the row count on a per DataPage basis.
+            // We use `row_group_row_counts` instead.
+            return Self::row_group_row_counts(row_group_metadatas);

Review Comment:
   > Why do we return row counts for a non existing column in 
row_group_row_counts?
   
   I think the (not great) reason is that it is the API needed for 
PruningStatistics here: 
https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L386-L390
   
   Also, since the `ParquetMetadata` knows how many row groups there are even 
when there are no row group statistics, it is possible
   
   For data pages, it is different as if the "page index" is not present then I 
don't think there is any way to know how many data pages there are other than 
reading the file 
   
   The more we explore this, the more you have convinced me that we shouldn't 
return row counts for non existent columns in `row_group_row_counts` either
   
   So I guess my new proposal would be to return `Option` like:
   
   ```rust
   impl<'a> StatisticsConverter<'a> {
   ...
     /// return OK(None) if the column does not exist
     pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> 
Result<Option<UInt64Array>> {
     ...
     }
   
     /// return OK(None) if the column does not exist
       pub fn data_page_row_counts<I>(
           &self,
           column_offset_index: &ParquetOffsetIndex,
           row_group_metadatas: &[RowGroupMetaData],
           row_group_indices: I,
       ) -> Result<Option<UInt64Array>>
       where
           I: IntoIterator<Item = &'a usize>,
       {
       ...
       }
   
   }
   ```
   
   The rationale to return an Option rather than an error is that creating and 
ignoring `DataFusionError` via `ok()` still requires a string allocation, which 
is wasteful
   
   I realize this is done many places already in the statistics extraction 
code, but I think for those cases it is meant to make the code resilent to 
incorrectly encoded parquet files rather than something that is "expected" to 
happen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

Reply via email to