sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448014374



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -387,15 +538,28 @@ impl<T: DataType> ColumnWriterImpl<T> {
             ));
         }
 
-        // TODO: update page statistics
+        if calculate_page_stats {
+            for val in &values[0..values_to_write] {
+                if self.min_page_value.is_none()

Review comment:
       maybe we should extract this into a util function `update_page_min_max` 
and similarly for column min/max value.

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -276,12 +372,60 @@ impl<T: DataType> ColumnWriterImpl<T> {
             &values[values_offset..],
             def_levels.map(|lv| &lv[levels_offset..]),
             rep_levels.map(|lv| &lv[levels_offset..]),
+            calculate_page_stats,
         )?;
 
         // Return total number of values processed.
         Ok(values_offset)
     }
 
+    /// Writes batch of values, definition levels and repetition levels.
+    /// Returns number of values processed (written).
+    ///
+    /// If definition and repetition levels are provided, we write fully those 
levels and
+    /// select how many values to write (this number will be returned), since 
number of
+    /// actual written values may be smaller than provided values.
+    ///
+    /// If only values are provided, then all values are written and the 
length of
+    /// of the values buffer is returned.
+    ///
+    /// Definition and/or repetition levels can be omitted, if values are
+    /// non-nullable and/or non-repeated.
+    pub fn write_batch(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+    ) -> Result<usize> {
+        self.write_batch_internal(
+            values, def_levels, rep_levels, &None, &None, None, None,
+        )
+    }
+
+    /// Writer may optionally provide pre-calculated statistics for this 
batch, in which case we do
+    /// not calculate page level statistics as this will defeat the purpose of 
speeding up the write
+    /// process with pre-calculated statistics.
+    pub fn write_batch_with_statistics(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+        min: &Option<T::T>,

Review comment:
       Is it possible that among these 4 parameters, ppl only provide, say, 
`nulls_count` but leave the rest as `None`? will this result to partial stats 
and yield to issues when compute engines want to rely on them? If so do we want 
to enforce that either all of these 4 are `None` OR all of these are `Some`?

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Seems this is not used at all?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to