[
https://issues.apache.org/jira/browse/HIVE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mayank Lahiri updated HIVE-1372:
--------------------------------
Attachment: HIVE-1372.3.patch
AFAIK, this is a floating point rounding error. I ran some tests on millions of
large random doubles and the differences are consistently in the last few
significant digits. Curiously, even the vanilla un-modified sum() UDAF produces
some differences in the last few digits from R's output when operating on
large-ish synthetic data, which leads me to believe that either Hive or Java's
default println is pushing out a few more digits than it should, or Java's
floating point handling is somehow quirky in terms of rounding.
I've corrected the two .q.out files and attached the patch.
> New algorithm for variance() UDAF
> ---------------------------------
>
> Key: HIVE-1372
> URL: https://issues.apache.org/jira/browse/HIVE-1372
> Project: Hadoop Hive
> Issue Type: Improvement
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Mayank Lahiri
> Assignee: Mayank Lahiri
> Priority: Minor
> Fix For: 0.6.0
>
> Attachments: HIVE-1372.2.patch, HIVE-1372.3.patch, HIVE-1372.patch
>
>
> A new algorithm for the UDAF that computes variance. This is pretty much a
> drop-in replacement for the current UDAF, and has two benefits: provably
> numerically stable (reference included in comments), and reduces arithmetic
> operations by about half.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.