alamb commented on a change in pull request #1547: URL: https://github.com/apache/arrow-datafusion/pull/1547#discussion_r783085256
########## File path: datafusion/src/physical_plan/expressions/variance.rs ########## @@ -230,93 +235,186 @@ impl VarianceAccumulator { self.count } - pub fn get_mean(&self) -> ScalarValue { - self.mean.clone() + pub fn get_mean(&self) -> f64 { + self.mean } - pub fn get_m2(&self) -> ScalarValue { - self.m2.clone() + pub fn get_m2(&self) -> f64 { + self.m2 } } impl Accumulator for VarianceAccumulator { fn state(&self) -> Result<Vec<ScalarValue>> { Ok(vec![ ScalarValue::from(self.count), - self.mean.clone(), - self.m2.clone(), + ScalarValue::from(self.mean), + ScalarValue::from(self.m2), ]) } + fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { + let values = &cast(&values[0], &DataType::Float64)?; + let arr = values.as_any().downcast_ref::<Float64Array>().unwrap(); + + for i in 0..arr.len() { + let value = arr.value(i); + + if value == 0_f64 && values.is_null(i) { + continue; + } + let new_count = self.count + 1; Review comment: > After some investigation, this approach does work as expected. FWIW I tried making these changes locally and all the tests passed for me > The reason for the null check is because downcast_ref replace the None values into 0_f64 so we need to check in the original array when a 0 is observed. I am not sure this is accurate. The way arrow works is that the values and "validity" are tracked in separate structures. Thus for elements that are `NULL` there is some arbitrary value in the array (which will likely be `0.0f`, though that is not guaranteed by the arrow spec). The construct of `arr.iter()` returns an iterator of `Option<f64>` that is `None` if the element is NULL, and `Some(f64_value)` if the element is non-NULL. the use of `filter_map` then filters out the `None` elements, somewhat non obviously This ```rust for value in arr.iter().filter_map(|v| v) { ``` Is effectively the same as ```rust for value in arr.iter() { let value = match value { Some(v) => v, None => continue, }; ``` So I actually think there is a bug in this code as written with nulls -- the check should be ```rust if values.is_null(i) { ``` Rather than ```rust if value == 0_f64 && values.is_null(i) { ``` (as null values are not guaranteed to be 0.0f) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org