[jira] [Updated] (PHOENIX-4899) Incorrect behaviour of LAST_VALUE aggregation UDF

Abhishek Gupta (JIRA) Tue, 11 Sep 2018 07:07:26 -0700


     [ 
https://issues.apache.org/jira/browse/PHOENIX-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Abhishek Gupta updated PHOENIX-4899:
------------------------------------
    Labels: Aggregation UDF aggregator udf udfs  (was: )

> Incorrect behaviour of LAST_VALUE aggregation UDF
> -------------------------------------------------
>
>                 Key: PHOENIX-4899
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4899
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.7.0
>         Environment: Phoenix version: 4.7.0
> HBase version: 1.1.2
>            Reporter: Abhishek Gupta
>            Priority: Critical
>              Labels: Aggregation, UDF, aggregator, udf, udfs
>
>  
> The LAST_VALUE aggregation UDF doesn't give correct result when group by 
> columns does not include all primary key columns. 
> The function works in a way that it ignores all the group by columns except 
> the column in the order by clause in finding the last value. For example
>  
> Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and 
> m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value 
> on m1 
>  
> {quote}pk1 .   | pk2                       | pk3 .    | pk4            | pk5 
> .   | m1         | m2 | m3
> ------------------------------------------------------------------------------------------------------
> | test    | 201808010000    | app1    | plat1          | lang1  | 1           
>  | 10  | 100  |
> | test    | 201808010000    | app1    | plat1          | lang2  | 2           
>  | 10  | 100  |
> | test    | 201808010000    | app1    | plat2          | lang1  | 3           
>  | 10  | 100  |
> | test    | 201808010000    | app2    | plat1          | lang1  | 4           
>  | 10  | 100  |
> | test    | 201808010030    | app1    | plat1          | lang1  | 10          
>  | 10  | 100  |
> | test    | 201808010030    | app1    | plat1          | lang2  | 20          
>  | 10  | 100  |
> | test    | 201808010030    | app1    | plat2          | lang1  | 30          
>  | 10  | 100  |
> | test    | 201808010030    | app2    | plat1          | lang1  | 40          
>  | 10  | 100  |
> | test    | 201808010100    | app1    | plat1          | lang1  | 100         
>  | 10  | 100  |
> | test    | 201808010100    | app1    | plat1          | lang2  | 200         
>  | 10  | 100  |
> | test    | 201808010100    | app1    | plat2          | lang1  | 300         
>  | 10  | 100  |
> | test    | 201808010100    | app2    | plat1          | lang1  | 400         
>  | 10  | 100  |
> | test    | 201808010130    | app1    | plat1          | lang1  | 1000        
>  | 10  | 100  |
> | test    | 201808010130    | app1    | plat1          | lang2  | 2000        
>  | 10  | 100  |
> | test    | 201808010130    | app1    | plat2          | lang1  | 3000        
>  | 10  | 100  |
> | test    | 201808010130    | app2    | plat1          | lang1  | 4000        
>  | 10  | 100  |
> | test    | 201808010200    | app1    | plat1          | lang1  | 10000       
>  | 10  | 100  |
> | test    | 201808010200    | app1    | plat1          | lang2  | 20000       
>  | 10  | 100  |
> | test    | 201808010200    | app1    | plat2          | lang1  | 30000       
>  | 10  | 100  |
> | test    | 201808010200    | app2    | plat1          | lang1  | 40000       
>  | 10  | 100  |{quote}
>  
> If I run the following query (using all primary key columns in group by)
>  
> {quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as 
> t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, sum(m2), 
> sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;{quote}
>  
> I get the correct result for last value as for each pk2 value which is used 
> in order by there is only 1 row
> {quote}+-------------+---------+----------------+--------+--------+----------+----------+
> |      T      | pk3  | pk4  |  pk5  |   OI   | SUM(M2)  | SUM(M3)  |
> +-------------+---------+----------------+--------+--------+----------+----------+
> | 2018080100  | app1    | plat1          | lang1  | 10     | 20       | 200   
>    |
> | 2018080100  | app1    | plat1          | lang2  | 20     | 20       | 200   
>    |
> | 2018080100  | app1    | plat2          | lang1  | 30     | 20       | 200   
>    |
> | 2018080100  | app2    | plat1          | lang1  | 40     | 20       | 200   
>    |
> | 2018080101  | app1    | plat1          | lang1  | 1000   | 20       | 200   
>    |
> | 2018080101  | app1    | plat1          | lang2  | 2000   | 20       | 200   
>    |
> | 2018080101  | app1    | plat2          | lang1  | 3000   | 20       | 200   
>    |
> | 2018080101  | app2    | plat1          | lang1  | 4000   | 20       | 200   
>    |
> | 2018080102  | app1    | plat1          | lang1  | 10000  | 20       | 100   
>    |
> | 2018080102  | app1    | plat1          | lang2  | 20000  | 10       | 100   
>    |
> | 2018080102  | app1    | plat2          | lang1  | 30000  | 10       | 100   
>    |
> | 2018080102  | app2    | plat1          | lang1  | 40000  | 10       | 100   
>    |
> +-------------+---------+----------------+--------+--------+----------+----------+{quote}
>  
> However if I do I group by on less than all the primary columns the 
> LAST_VALUE function ignores the rest of the group by columns in sorting and 
> returns incorrect last_value
>  
> {quote}select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3, 
> last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from 
> test_table group by pk1,t,pk3;{quote}
>  
> {quote}+-------------+---------+--------+----------+----------+
> |      T      | pk3  |   OI   | SUM(M2)  | SUM(M3)  |
> +-------------+---------+--------+----------+----------+
> | 2018080100  | app1    | 10     | 60       | 600      |
> | 2018080100  | app2    | 40     | 20       | 200      |
> | 2018080101  | app1    | 1000   | 60       | 600      |
> | 2018080101  | app2    | 4000   | 20       | 200      |
> | 2018080102  | app1    | 10000  | 40       | 300      |
> | 2018080102  | app2    | 40000  | 10       | 100      |
> +-------------+---------+--------+----------+----------+ {quote}
>  
> So instead of taking the last value of the group formed by  2018080100 and 
> app1 i.e 30 it is picking the first i.e 10.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PHOENIX-4899) Incorrect behaviour of LAST_VALUE aggregation UDF

Reply via email to