[
https://issues.apache.org/jira/browse/HIVE-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735890#action_12735890
]
Emil Ibrishimov commented on HIVE-607:
--------------------------------------
Hey Scott. The formula you are using has precision problems when the variance
is very small relatively to the sum of squares (devavg and avg*avg can get
really big while at the same time the variance can still be really small and
this way a lot of information can be lost - sometimes the result can be even
negative).
I am using a modification of this formula:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
which fixes this problem.
I will attach a patch tomorrow when I'm done testing it.
> Create statistical UDFs.
> ------------------------
>
> Key: HIVE-607
> URL: https://issues.apache.org/jira/browse/HIVE-607
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Query Processor
> Reporter: S. Alex Smith
> Assignee: Emil Ibrishimov
> Priority: Minor
> Attachments: UDAFStddev.java
>
>
> Create UDFs replicating:
> STD() Return the population standard deviation
> STDDEV_POP()(v5.0.3) Return the population standard deviation
> STDDEV_SAMP()(v5.0.3) Return the sample standard deviation
> STDDEV() Return the population standard deviation
> SUM() Return the sum
> VAR_POP()(v5.0.3) Return the population standard variance
> VAR_SAMP()(v5.0.3) Return the sample variance
> VARIANCE()(v4.1) Return the population standard variance
> as found at http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.