Quanlong Huang created IMPALA-12347:
---------------------------------------
Summary: Cumulated floating point error in window functions
Key: IMPALA-12347
URL: https://issues.apache.org/jira/browse/IMPALA-12347
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Quanlong Huang
In the developement of IMPALA-11957, [~pranav.lodha] found the following query
has different result than Hive:
{code:sql}
select s_store_sk,
regr_slope(s_number_employees, s_floor_space)
over (partition by s_city order by s_store_sk
rows between 1 preceding and 1 following)
from tpcds.store;{code}
The following query is simpler but can still reproduce the difference:
{code:sql}
select regr_slope(a, b) over (order by b rows between 1 preceding and 1
following)
from (values (271 a, 6995995 b), (294, 9294113), (294, 9294113)) v;{code}
The results in Hive (correct):
{noformat}
+------------------------+
| regr_slope_window_0 |
+------------------------+
| 1.0008189309687318E-5 |
| 1.0008189309687323E-5 |
| NULL |
+------------------------+ {noformat}
The results in Impala (last line is wrong):
{noformat}
+----------------------------+
| regr_slope(a, b) OVER(...) |
+----------------------------+
| 1.00081893097e-05 |
| 1.00081893097e-05 |
| 2.13623046875e-05 |
+----------------------------+{noformat}
The last two points are the same so the slope should be NULL.
The difference is due to cumulated floating point error in Impala. The
intermediate state of regression functions consist of double values. They can
have more error if we have more computation.
In Impala, each analytic function has a remove method (remove_fn_) to deal with
expiring rows when sliding the window. They also have an update method to add
new rows. However, in Hive, analytic functions don't need remove methods. Each
time when sliding the window, Hive calculates the analytic function by
iterating all rows in the window from scratch:
[https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L205-L207]
[https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L230-L240]
The implementation in Hive can achieve less floating point computation error
since for the value of each row, the compuation happens only on rows inside the
window. However, in Impala, to get the value of each row, we need to invoke the
remove method to update the intermediate state, then invoke the update method
to add the current row. The intermediate state cumulates the floating point
computation error.
For evaluating analytic functions in small window sizes, maybe we should switch
to Hive's pattern to have higher precision.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)