crepererum commented on a change in pull request #307:
URL: https://github.com/apache/arrow-rs/pull/307#discussion_r638598264



##########
File path: parquet/src/column/writer.rs
##########
@@ -607,9 +607,11 @@ impl<T: DataType> ColumnWriterImpl<T> {
         let max_def_level = self.descr.max_def_level();
         let max_rep_level = self.descr.max_rep_level();
 
+        // always update column NULL count, no matter if page stats are used
+        self.num_column_nulls += self.num_page_nulls;

Review comment:
       I've just checked:
   
   ```rust
   #[test]
   fn statistics_null_counts_big_mixed() {
       // check that null-count statistics work for larger data sizes
       let len = 1_000_000u64;
       let data: Vec<_> = (0..len).map(|x| if x % 2 == 0 {Some(x)} else 
{None}).collect();
       let values = Arc::new(UInt64Array::from(data));
       let file = one_column_roundtrip("null_counts_big_mixed", values, true);
   
       // check statistics are valid
       let reader = SerializedFileReader::new(file).unwrap();
       let metadata = reader.metadata();
       assert_eq!(metadata.num_row_groups(), 1);
       let row_group = metadata.row_group(0);
       assert_eq!(row_group.num_columns(), 1);
       let column = row_group.column(0);
       let stats = column.statistics().unwrap();
       assert_eq!(stats.null_count(), len / 2);
   }
   ```
   
   This triggers the `add_data_page` via `write_mini_batch` code path (with a 
small patch to the roundtrip test code to set a higher batch size). That works 
(with this PR) and both values are set. TBH that's what I expected since I 
don't really understand your reasoning behind "In this case it seems 
num_page_nulls and subsequently num_column_nulls will not be set.".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to