[ 
https://issues.apache.org/jira/browse/HIVE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Lahiri updated HIVE-1372:
--------------------------------

    Attachment: HIVE-1372.3.patch

AFAIK, this is a floating point rounding error. I ran some tests on millions of 
large random doubles and the differences are consistently in the last few 
significant digits. Curiously, even the vanilla un-modified sum() UDAF produces 
some differences in the last few digits from R's output when operating on 
large-ish synthetic data, which leads me to believe that either Hive or Java's 
default println is pushing out a few more digits than it should, or Java's 
floating point handling is somehow quirky in terms of rounding.

I've corrected the two .q.out files and attached the patch.

> New algorithm for variance() UDAF
> ---------------------------------
>
>                 Key: HIVE-1372
>                 URL: https://issues.apache.org/jira/browse/HIVE-1372
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>            Priority: Minor
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1372.2.patch, HIVE-1372.3.patch, HIVE-1372.patch
>
>
> A new algorithm for the UDAF that computes variance. This is pretty much a 
> drop-in replacement for the current UDAF, and has two benefits: provably 
> numerically stable (reference included in comments), and reduces arithmetic 
> operations by about half.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to