[
https://issues.apache.org/jira/browse/PHOENIX-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhishek Gupta updated PHOENIX-4899:
------------------------------------
Labels: Aggregation UDF aggregator udf udfs (was: )
> Incorrect behaviour of LAST_VALUE aggregation UDF
> -------------------------------------------------
>
> Key: PHOENIX-4899
> URL: https://issues.apache.org/jira/browse/PHOENIX-4899
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.7.0
> Environment: Phoenix version: 4.7.0
> HBase version: 1.1.2
> Reporter: Abhishek Gupta
> Priority: Critical
> Labels: Aggregation, UDF, aggregator, udf, udfs
>
>
> The LAST_VALUE aggregation UDF doesn't give correct result when group by
> columns does not include all primary key columns.
> The function works in a way that it ignores all the group by columns except
> the column in the order by clause in finding the last value. For example
>
> Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and
> m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value
> on m1
>
> {quote}pk1 . | pk2 | pk3 . | pk4 | pk5
> . | m1 | m2 | m3
> ------------------------------------------------------------------------------------------------------
> | test | 201808010000 | app1 | plat1 | lang1 | 1
> | 10 | 100 |
> | test | 201808010000 | app1 | plat1 | lang2 | 2
> | 10 | 100 |
> | test | 201808010000 | app1 | plat2 | lang1 | 3
> | 10 | 100 |
> | test | 201808010000 | app2 | plat1 | lang1 | 4
> | 10 | 100 |
> | test | 201808010030 | app1 | plat1 | lang1 | 10
> | 10 | 100 |
> | test | 201808010030 | app1 | plat1 | lang2 | 20
> | 10 | 100 |
> | test | 201808010030 | app1 | plat2 | lang1 | 30
> | 10 | 100 |
> | test | 201808010030 | app2 | plat1 | lang1 | 40
> | 10 | 100 |
> | test | 201808010100 | app1 | plat1 | lang1 | 100
> | 10 | 100 |
> | test | 201808010100 | app1 | plat1 | lang2 | 200
> | 10 | 100 |
> | test | 201808010100 | app1 | plat2 | lang1 | 300
> | 10 | 100 |
> | test | 201808010100 | app2 | plat1 | lang1 | 400
> | 10 | 100 |
> | test | 201808010130 | app1 | plat1 | lang1 | 1000
> | 10 | 100 |
> | test | 201808010130 | app1 | plat1 | lang2 | 2000
> | 10 | 100 |
> | test | 201808010130 | app1 | plat2 | lang1 | 3000
> | 10 | 100 |
> | test | 201808010130 | app2 | plat1 | lang1 | 4000
> | 10 | 100 |
> | test | 201808010200 | app1 | plat1 | lang1 | 10000
> | 10 | 100 |
> | test | 201808010200 | app1 | plat1 | lang2 | 20000
> | 10 | 100 |
> | test | 201808010200 | app1 | plat2 | lang1 | 30000
> | 10 | 100 |
> | test | 201808010200 | app2 | plat1 | lang1 | 40000
> | 10 | 100 |{quote}
>
> If I run the following query (using all primary key columns in group by)
>
> {quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as
> t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, sum(m2),
> sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;{quote}
>
> I get the correct result for last value as for each pk2 value which is used
> in order by there is only 1 row
> {quote}+-------------+---------+----------------+--------+--------+----------+----------+
> | T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) |
> +-------------+---------+----------------+--------+--------+----------+----------+
> | 2018080100 | app1 | plat1 | lang1 | 10 | 20 | 200
> |
> | 2018080100 | app1 | plat1 | lang2 | 20 | 20 | 200
> |
> | 2018080100 | app1 | plat2 | lang1 | 30 | 20 | 200
> |
> | 2018080100 | app2 | plat1 | lang1 | 40 | 20 | 200
> |
> | 2018080101 | app1 | plat1 | lang1 | 1000 | 20 | 200
> |
> | 2018080101 | app1 | plat1 | lang2 | 2000 | 20 | 200
> |
> | 2018080101 | app1 | plat2 | lang1 | 3000 | 20 | 200
> |
> | 2018080101 | app2 | plat1 | lang1 | 4000 | 20 | 200
> |
> | 2018080102 | app1 | plat1 | lang1 | 10000 | 20 | 100
> |
> | 2018080102 | app1 | plat1 | lang2 | 20000 | 10 | 100
> |
> | 2018080102 | app1 | plat2 | lang1 | 30000 | 10 | 100
> |
> | 2018080102 | app2 | plat1 | lang1 | 40000 | 10 | 100
> |
> +-------------+---------+----------------+--------+--------+----------+----------+{quote}
>
> However if I do I group by on less than all the primary columns the
> LAST_VALUE function ignores the rest of the group by columns in sorting and
> returns incorrect last_value
>
> {quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3,
> last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from
> test_table group by pk1,t,pk3;{quote}
>
> {quote}+-------------+---------+--------+----------+----------+
> | T | pk3 | OI | SUM(M2) | SUM(M3) |
> +-------------+---------+--------+----------+----------+
> | 2018080100 | app1 | 10 | 60 | 600 |
> | 2018080100 | app2 | 40 | 20 | 200 |
> | 2018080101 | app1 | 1000 | 60 | 600 |
> | 2018080101 | app2 | 4000 | 20 | 200 |
> | 2018080102 | app1 | 10000 | 40 | 300 |
> | 2018080102 | app2 | 40000 | 10 | 100 |
> +-------------+---------+--------+----------+----------+ {quote}
>
> So instead of taking the last value of the group formed by 2018080100 and
> app1 i.e 30 it is picking the first i.e 10.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)