freeformstu opened a new issue, #4314:
URL: https://github.com/apache/arrow-datafusion/issues/4314

   **Describe the bug**
   A clear and concise description of what the bug is.
   
   The variance aggregation calculation returns an error if there is only one 
item. But it should return `0`.
   
   If given the sequence `[1, 1, 1, 1]`, the variance will be zero. If given 
the sequence `[1]` the variance should also be zero.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   Run the tests, there is a test which asserts that given a single item, 
variance will return an error.
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   If given a sequence of length 1, the variance should calculate `0`.
   
   **Additional context**
   Add any other context about the problem here.
   
   The variance accumulator cites its reference implementation here:
   
https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/aggregate/variance.rs#L173-L179
   > /// The algrithm used is an online implementation and numerically stable. 
It is based on this paper:
   > /// Welford, B. P. (1962). "Note on a method for calculating corrected 
sums of squares and products".
   > /// Technometrics. 4 (3): 419–420. doi:10.2307/1266577. JSTOR 1266577.
   The summary of which can be found here 
https://www.tandfonline.com/doi/abs/10.1080/00401706.1962.10490022.
   
   As far as I can tell, this is just a bug not a feature of that algorithm.
   
   Additionally, there is another bug in the current logic:
   
https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/aggregate/variance.rs#L297-L307
   ```rust
           if count <= 1 {
               return Err(DataFusionError::Internal(
                   "At least two values are needed to calculate 
variance".to_string(),
               ));
           }
   
           if self.count == 0 {
               Ok(ScalarValue::Float64(None))
           } else {
               Ok(ScalarValue::Float64(Some(self.m2 / count as f64)))
           }
   ```
   If `count` is `0`, then we will always return an error even though that 
second `if` block suggests intent to return `None` in the case of zero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to