Russell Jurney commented on PIG-1150:

Oh - one other thing - I've read that this naive parallel method of calculating 
variance can have precision problems - all those double's getting subtracted 
from one another and then squared.  I've thought of using BigDecimal, which can 
handle arbitrary precision numbers.  My understanding is that this would be 
slow, but that it would probably still be IO bound.  

Is that something people would like to see?  I could maybe make another UDF 
that uses BigDecimal or something.  I've never actually encountered the 
precision problems in practice, but I can see how that might be a big problem 
for some people.

> VAR() Variance UDF
> ------------------
>                 Key: PIG-1150
>                 URL: https://issues.apache.org/jira/browse/PIG-1150
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.5.0
>         Environment: UDF, written in Pig 0.5 contrib/
>            Reporter: Russell Jurney
>             Fix For: 0.7.0
>         Attachments: var.patch
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to