sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448014374
##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -387,15 +538,28 @@ impl<T: DataType> ColumnWriterImpl<T> {
));
}
- // TODO: update page statistics
+ if calculate_page_stats {
+ for val in &values[0..values_to_write] {
+ if self.min_page_value.is_none()
Review comment:
maybe we should extract this into a util function `update_page_min_max`
and similarly for column min/max value.
##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -276,12 +372,60 @@ impl<T: DataType> ColumnWriterImpl<T> {
&values[values_offset..],
def_levels.map(|lv| &lv[levels_offset..]),
rep_levels.map(|lv| &lv[levels_offset..]),
+ calculate_page_stats,
)?;
// Return total number of values processed.
Ok(values_offset)
}
+ /// Writes batch of values, definition levels and repetition levels.
+ /// Returns number of values processed (written).
+ ///
+ /// If definition and repetition levels are provided, we write fully those
levels and
+ /// select how many values to write (this number will be returned), since
number of
+ /// actual written values may be smaller than provided values.
+ ///
+ /// If only values are provided, then all values are written and the
length of
+ /// of the values buffer is returned.
+ ///
+ /// Definition and/or repetition levels can be omitted, if values are
+ /// non-nullable and/or non-repeated.
+ pub fn write_batch(
+ &mut self,
+ values: &[T::T],
+ def_levels: Option<&[i16]>,
+ rep_levels: Option<&[i16]>,
+ ) -> Result<usize> {
+ self.write_batch_internal(
+ values, def_levels, rep_levels, &None, &None, None, None,
+ )
+ }
+
+ /// Writer may optionally provide pre-calculated statistics for this
batch, in which case we do
+ /// not calculate page level statistics as this will defeat the purpose of
speeding up the write
+ /// process with pre-calculated statistics.
+ pub fn write_batch_with_statistics(
+ &mut self,
+ values: &[T::T],
+ def_levels: Option<&[i16]>,
+ rep_levels: Option<&[i16]>,
+ min: &Option<T::T>,
Review comment:
Is it possible that among these 4 parameters, ppl only provide, say,
`nulls_count` but leave the rest as `None`? will this result to partial stats
and yield to issues when compute engines want to rely on them? If so do we want
to enforce that either all of these 4 are `None` OR all of these are `Some`?
##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
def_levels_sink: vec![],
rep_levels_sink: vec![],
data_pages: VecDeque::new(),
+ min_page_value: None,
+ max_page_value: None,
+ num_page_nulls: 0,
+ page_distinct_count: None,
Review comment:
Seems this is not used at all?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]