Abhishek Gupta created PHOENIX-4899:
---------------------------------------
Summary: Incorrect behaviour of LAST_VALUE aggregation UDF
Key: PHOENIX-4899
URL: https://issues.apache.org/jira/browse/PHOENIX-4899
Project: Phoenix
Issue Type: Bug
Affects Versions: 4.7.0
Environment: Phoenix version: 4.7.0
HBase version: 1.1.2
Reporter: Abhishek Gupta
The LAST_VALUE aggregation UDF doesn't give correct result when group by
columns does not include all primary key columns.
The function works in a way that it ignores all the group by columns except the
column in the order by clause in finding the last value. For example
Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and
m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value on
m1
{quote}pk1 . | pk2 | pk3 . | pk4 | pk5 .
| m1 | m2 | m3
------------------------------------------------------------------------------------------------------
| test | 201808010000 | app1 | plat1 | lang1 | 1
| 10 | 100 |
| test | 201808010000 | app1 | plat1 | lang2 | 2
| 10 | 100 |
| test | 201808010000 | app1 | plat2 | lang1 | 3
| 10 | 100 |
| test | 201808010000 | app2 | plat1 | lang1 | 4
| 10 | 100 |
| test | 201808010030 | app1 | plat1 | lang1 | 10
| 10 | 100 |
| test | 201808010030 | app1 | plat1 | lang2 | 20
| 10 | 100 |
| test | 201808010030 | app1 | plat2 | lang1 | 30
| 10 | 100 |
| test | 201808010030 | app2 | plat1 | lang1 | 40
| 10 | 100 |
| test | 201808010100 | app1 | plat1 | lang1 | 100
| 10 | 100 |
| test | 201808010100 | app1 | plat1 | lang2 | 200
| 10 | 100 |
| test | 201808010100 | app1 | plat2 | lang1 | 300
| 10 | 100 |
| test | 201808010100 | app2 | plat1 | lang1 | 400
| 10 | 100 |
| test | 201808010130 | app1 | plat1 | lang1 | 1000
| 10 | 100 |
| test | 201808010130 | app1 | plat1 | lang2 | 2000
| 10 | 100 |
| test | 201808010130 | app1 | plat2 | lang1 | 3000
| 10 | 100 |
| test | 201808010130 | app2 | plat1 | lang1 | 4000
| 10 | 100 |
| test | 201808010200 | app1 | plat1 | lang1 | 10000
| 10 | 100 |
| test | 201808010200 | app1 | plat1 | lang2 | 20000
| 10 | 100 |
| test | 201808010200 | app1 | plat2 | lang1 | 30000
| 10 | 100 |
| test | 201808010200 | app2 | plat1 | lang1 | 40000
| 10 | 100 |{quote}
If I run the following query (using all primary key columns in group by)
{quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as
t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, sum(m2),
sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;{quote}
I get the correct result for last value as for each pk2 value which is used in
order by there is only 1 row
{quote}+-------------+---------+----------------+--------+--------+----------+----------+
| T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) |
+-------------+---------+----------------+--------+--------+----------+----------+
| 2018080100 | app1 | plat1 | lang1 | 10 | 20 | 200
|
| 2018080100 | app1 | plat1 | lang2 | 20 | 20 | 200
|
| 2018080100 | app1 | plat2 | lang1 | 30 | 20 | 200
|
| 2018080100 | app2 | plat1 | lang1 | 40 | 20 | 200
|
| 2018080101 | app1 | plat1 | lang1 | 1000 | 20 | 200
|
| 2018080101 | app1 | plat1 | lang2 | 2000 | 20 | 200
|
| 2018080101 | app1 | plat2 | lang1 | 3000 | 20 | 200
|
| 2018080101 | app2 | plat1 | lang1 | 4000 | 20 | 200
|
| 2018080102 | app1 | plat1 | lang1 | 10000 | 20 | 100
|
| 2018080102 | app1 | plat1 | lang2 | 20000 | 10 | 100
|
| 2018080102 | app1 | plat2 | lang1 | 30000 | 10 | 100
|
| 2018080102 | app2 | plat1 | lang1 | 40000 | 10 | 100
|
+-------------+---------+----------------+--------+--------+----------+----------+{quote}
However if I do I group by on less than all the primary columns the LAST_VALUE
function ignores the rest of the group by columns in sorting and returns
incorrect last_value
{quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3,
last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from
test_table group by pk1,t,pk3;{quote}
{quote}+-------------+---------+--------+----------+----------+
| T | pk3 | OI | SUM(M2) | SUM(M3) |
+-------------+---------+--------+----------+----------+
| 2018080100 | app1 | 10 | 60 | 600 |
| 2018080100 | app2 | 40 | 20 | 200 |
| 2018080101 | app1 | 1000 | 60 | 600 |
| 2018080101 | app2 | 4000 | 20 | 200 |
| 2018080102 | app1 | 10000 | 40 | 300 |
| 2018080102 | app2 | 40000 | 10 | 100 |
+-------------+---------+--------+----------+----------+ {quote}
So instead of taking the last value of the group formed by 2018080100 and app1
i.e 30 it is picking the first i.e 10.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)