alamb commented on code in PR #5377: URL: https://github.com/apache/arrow-datafusion/pull/5377#discussion_r1119385689
########## datafusion/physical-expr/src/aggregate/count_distinct.rs: ########## @@ -149,6 +149,43 @@ impl DistinctCountAccumulator { self.update(&row_values) }) } + + // calculating the size for fixed length values, taking first batch size * number of batches + // This method is faster than .full_size(), however it is not suitable for variable length values like strings or complex types + fn fixed_size(&self) -> usize { + std::mem::size_of_val(self) + + (std::mem::size_of::<DistinctScalarValues>() * self.values.capacity()) + + self + .values + .iter() + .next() Review Comment: 👍 that is nice ########## datafusion/physical-expr/src/aggregate/count_distinct.rs: ########## @@ -216,23 +253,27 @@ impl Accumulator for DistinctCountAccumulator { } fn size(&self) -> usize { - std::mem::size_of_val(self) - + (std::mem::size_of::<DistinctScalarValues>() * self.values.capacity()) - + self - .values - .iter() - .map(|vals| { - ScalarValue::size_of_vec(&vals.0) - std::mem::size_of_val(&vals.0) - }) - .sum::<usize>() - + (std::mem::size_of::<DataType>() * self.state_data_types.capacity()) - + self - .state_data_types - .iter() - .map(|dt| dt.size() - std::mem::size_of_val(dt)) - .sum::<usize>() - + self.count_data_type.size() - - std::mem::size_of_val(&self.count_data_type) + match &self.count_data_type { + DataType::Boolean + | DataType::Date32 + | DataType::Date64 + | DataType::Float16 + | DataType::Float32 + | DataType::Float64 + | DataType::Int16 + | DataType::Int32 + | DataType::Int64 + | DataType::Int8 + | DataType::Time32(_) + | DataType::Time64(_) + | DataType::Null + | DataType::Timestamp(_, _) Review Comment: I think this is missing some types like `Interval` and `Duration` Perhaps you could check instead if `DataType::primitive_width` returned `Some(.)` 🤔 https://docs.rs/arrow/34.0.0/arrow/datatypes/enum.DataType.html#method.primitive_width -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org