alamb commented on code in PR #5183:
URL: https://github.com/apache/arrow-rs/pull/5183#discussion_r1420758889
##########
parquet/src/column/writer/mod.rs:
##########
@@ -329,28 +329,11 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a,
E> {
None => values.len(),
};
- // If only computing chunk-level statistics compute them here,
page-level statistics
- // are computed in [`Self::write_mini_batch`] and used to update chunk
statistics in
- // [`Self::add_data_page`]
- if self.statistics_enabled == EnabledStatistics::Chunk
Review Comment:
Are there any performance implications of always computing the statistics
even when they aren't written to the page?
Though it seems like if we are writing column chunk level statistics anyways
we would have to calculate the min/max across all values anyways, so tracking
per page shouldn't be any more expensive
##########
parquet/src/column/writer/mod.rs:
##########
@@ -329,28 +329,11 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a,
E> {
None => values.len(),
};
- // If only computing chunk-level statistics compute them here,
page-level statistics
- // are computed in [`Self::write_mini_batch`] and used to update chunk
statistics in
- // [`Self::add_data_page`]
- if self.statistics_enabled == EnabledStatistics::Chunk
- // INTERVAL has undefined sort order, so don't write min/max stats
for it
- && self.descr.converted_type() != ConvertedType::INTERVAL
- {
- match (min, max) {
- (Some(min), Some(max)) => {
- update_min(&self.descr, min, &mut
self.column_metrics.min_column_value);
- update_max(&self.descr, max, &mut
self.column_metrics.max_column_value);
- }
- (None, Some(_)) | (Some(_), None) => {
- panic!("min/max should be both set or both None")
- }
- (None, None) => {
- if let Some((min, max)) = self.encoder.min_max(values,
value_indices) {
- update_min(&self.descr, &min, &mut
self.column_metrics.min_column_value);
- update_max(&self.descr, &max, &mut
self.column_metrics.max_column_value);
- }
- }
- };
+ if let Some(min) = min {
Review Comment:
This min is only passed in when precomputed statistics are used
(`write_batch_with_statistics`). Is that correct? I couldn't find another
callsite where `min` and `max` were non zero
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]