[ 
https://issues.apache.org/jira/browse/IMPALA-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755492#comment-17755492
 ] 

Daniel Becker commented on IMPALA-12347:
----------------------------------------

Nice catch [~pranav.lodha] and [~stigahuang].

In addition to floating point rounding errors I presume this error could also 
happen with very big numbers near the maximal value of the floating point type: 
the aggregate value may become INF and we can never get back to finite values 
even if the subsequent numbers are small. One way to handle it would be to 
switch to the Hive pattern (at least temporarily) when we see infinite or NaN 
values.

We should note that switching to the Hive pattern is only useful for imprecise 
types, i.e. FLOAT and DOUBLE. For other types, such as integers, the operations 
are precise _except_ for signed overflow, which is undefined behaviour so the 
Hive pattern wouldn't help (once we've run into undefined behaviour the program 
is undefined).

The new behaviour could be behind a query option because
* if someone relies on the old behaviour they can get it back
* if someone needs correct (or more precise) results even for large windows 
they can have it in exchange for worse performance
* debugging the two patterns (Hive vs. old Impala) is easier if we can control 
it directly.

> Cumulated floating point error in window functions
> --------------------------------------------------
>
>                 Key: IMPALA-12347
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12347
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Quanlong Huang
>            Priority: Major
>
> In the developement of IMPALA-11957, [~pranav.lodha] found the following 
> query has different result than Hive:
> {code:sql}
> select s_store_sk,
>   regr_slope(s_number_employees, s_floor_space)
>   over (partition by s_city order by s_store_sk
>     rows between 1 preceding and 1 following)
> from tpcds.store;{code}
> The following query is simpler but can still reproduce the difference:
> {code:sql}
> select regr_slope(a, b) over (order by b rows between 1 preceding and 1 
> following)
> from (values (271 a, 6995995 b), (294, 9294113), (294, 9294113)) v;{code}
> The results in Hive (correct):
> {noformat}
> +------------------------+
> |  regr_slope_window_0   |
> +------------------------+
> | 1.0008189309687318E-5  |
> | 1.0008189309687323E-5  |
> | NULL                   |
> +------------------------+ {noformat}
> The results in Impala (last line is wrong):
> {noformat}
> +----------------------------+
> | regr_slope(a, b) OVER(...) |
> +----------------------------+
> | 1.00081893097e-05          |
> | 1.00081893097e-05          |
> | 2.13623046875e-05          |
> +----------------------------+{noformat}
> The last two points are the same so the slope should be NULL.
> The difference is due to cumulated floating point error in Impala. The 
> intermediate state of regression functions consist of double values. They can 
> have more error if we have more computation.
> In Impala, each analytic function has a remove method (remove_fn_) to deal 
> with expiring rows when sliding the window. They also have an update method 
> to add new rows. However, in Hive, analytic functions don't need remove 
> methods. Each time when sliding the window, Hive calculates the analytic 
> function by iterating all rows in the window from scratch:
> [https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L205-L207]
> [https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L230-L240]
> The implementation in Hive can achieve less floating point computation error 
> since for the value of each row, the compuation happens only on rows inside 
> the window. However, in Impala, to get the value of each row, we need to 
> invoke the remove method to update the intermediate state, then invoke the 
> update method to add the current row. The intermediate state cumulates the 
> floating point computation error.
> For evaluating analytic functions in small window sizes, maybe we should 
> switch to Hive's pattern to have higher precision.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to