alamb commented on a change in pull request #1547:
URL: https://github.com/apache/arrow-datafusion/pull/1547#discussion_r783085256



##########
File path: datafusion/src/physical_plan/expressions/variance.rs
##########
@@ -230,93 +235,186 @@ impl VarianceAccumulator {
         self.count
     }
 
-    pub fn get_mean(&self) -> ScalarValue {
-        self.mean.clone()
+    pub fn get_mean(&self) -> f64 {
+        self.mean
     }
 
-    pub fn get_m2(&self) -> ScalarValue {
-        self.m2.clone()
+    pub fn get_m2(&self) -> f64 {
+        self.m2
     }
 }
 
 impl Accumulator for VarianceAccumulator {
     fn state(&self) -> Result<Vec<ScalarValue>> {
         Ok(vec![
             ScalarValue::from(self.count),
-            self.mean.clone(),
-            self.m2.clone(),
+            ScalarValue::from(self.mean),
+            ScalarValue::from(self.m2),
         ])
     }
 
+    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
+        let values = &cast(&values[0], &DataType::Float64)?;
+        let arr = values.as_any().downcast_ref::<Float64Array>().unwrap();
+
+        for i in 0..arr.len() {
+            let value = arr.value(i);
+
+            if value == 0_f64 && values.is_null(i) {
+                continue;
+            }
+            let new_count = self.count + 1;

Review comment:
       > After some investigation, this approach does work as expected. 
   
   FWIW I tried making these changes locally and all the tests passed for me
   
   > The reason for the null check is because downcast_ref replace the None 
values into 0_f64 so we need to check in the original array when a 0 is 
observed. 
   
   I am not sure this is accurate. The way arrow works is that the values and 
"validity" are tracked in separate structures.
   
   Thus for elements that are `NULL` there is some arbitrary value in the array 
(which will likely be `0.0f`,  though that is not guaranteed by the arrow spec).
   
   The construct of `arr.iter()` returns an iterator of `Option<f64>` that is 
`None` if the element is NULL, and `Some(f64_value)` if the element is non-NULL.
   
   the use of `filter_map` then filters out the `None` elements, somewhat non 
obviously
   
   This
   ```rust
           for value in arr.iter().filter_map(|v| v) {
   ```
   
   Is effectively the same as
   ```rust
           for value in arr.iter() {
             let value = match value {
               Some(v) => v,
               None => continue,
           };
   ```
   
   So I actually think there is a bug in this code as written with nulls -- the 
check should be
   
   ```rust
               if values.is_null(i) {
   ```
   
   Rather than
   
   ```rust
               if value == 0_f64 && values.is_null(i) {
   ```
   
   (as null values are not guaranteed to be 0.0f)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to