Quanlong Huang created IMPALA-12347:
---------------------------------------

             Summary: Cumulated floating point error in window functions
                 Key: IMPALA-12347
                 URL: https://issues.apache.org/jira/browse/IMPALA-12347
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Quanlong Huang


In the developement of IMPALA-11957, [~pranav.lodha] found the following query 
has different result than Hive:
{code:sql}
select s_store_sk,
  regr_slope(s_number_employees, s_floor_space)
  over (partition by s_city order by s_store_sk
    rows between 1 preceding and 1 following)
from tpcds.store;{code}
The following query is simpler but can still reproduce the difference:
{code:sql}
select regr_slope(a, b) over (order by b rows between 1 preceding and 1 
following)
from (values (271 a, 6995995 b), (294, 9294113), (294, 9294113)) v;{code}
The results in Hive (correct):
{noformat}
+------------------------+
|  regr_slope_window_0   |
+------------------------+
| 1.0008189309687318E-5  |
| 1.0008189309687323E-5  |
| NULL                   |
+------------------------+ {noformat}
The results in Impala (last line is wrong):
{noformat}
+----------------------------+
| regr_slope(a, b) OVER(...) |
+----------------------------+
| 1.00081893097e-05          |
| 1.00081893097e-05          |
| 2.13623046875e-05          |
+----------------------------+{noformat}
The last two points are the same so the slope should be NULL.

The difference is due to cumulated floating point error in Impala. The 
intermediate state of regression functions consist of double values. They can 
have more error if we have more computation.

In Impala, each analytic function has a remove method (remove_fn_) to deal with 
expiring rows when sliding the window. They also have an update method to add 
new rows. However, in Hive, analytic functions don't need remove methods. Each 
time when sliding the window, Hive calculates the analytic function by 
iterating all rows in the window from scratch:
[https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L205-L207]
[https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L230-L240]

The implementation in Hive can achieve less floating point computation error 
since for the value of each row, the compuation happens only on rows inside the 
window. However, in Impala, to get the value of each row, we need to invoke the 
remove method to update the intermediate state, then invoke the update method 
to add the current row. The intermediate state cumulates the floating point 
computation error.

For evaluating analytic functions in small window sizes, maybe we should switch 
to Hive's pattern to have higher precision.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to