Russell Jurney commented on PIG-1150:
Oh - one other thing - I've read that this naive parallel method of calculating
variance can have precision problems - all those double's getting subtracted
from one another and then squared. I've thought of using BigDecimal, which can
handle arbitrary precision numbers. My understanding is that this would be
slow, but that it would probably still be IO bound.
Is that something people would like to see? I could maybe make another UDF
that uses BigDecimal or something. I've never actually encountered the
precision problems in practice, but I can see how that might be a big problem
for some people.
> VAR() Variance UDF
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
> Issue Type: New Feature
> Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
> Reporter: Russell Jurney
> Fix For: 0.7.0
> Attachments: var.patch
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates
> variance in a distributed manner, based on the AVG() builtin. It works by
> calculating the count, sum and sum of squares, as described here:
> Is this a worthwhile contribution? Taking the square root of this value
> using the contrib SQRT() function gives Standard Deviation, which is missing
> from Pig.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.