[jira] [Updated] (HIVE-15872) The PERCENTILE UDAF does not work with empty set

Chaozhong Yang (JIRA) Fri, 10 Feb 2017 07:21:07 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chaozhong Yang updated HIVE-15872:
----------------------------------
    Description: 
1. Original SQL:

select
    percentile_approx(
        column0,
        array(0.50, 0.70, 0.90, 0.95, 0.99)
    )
from
    my_table
where
    date = '20170207'
    and column1 = 'value1'
    and column2 = 'value2'
    and column3 = 'value3'
    and column4 = 'value4'
    and column5 = 'value5'

2. Exception StackTrace:

Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) ... 
7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766)
 at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235) 
... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at 
java.util.ArrayList.rangeCheck(ArrayList.java:653) at 
java.util.ArrayList.get(ArrayList.java:429) at 
org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134)
 at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318)
 at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)
 ... 8 more

3. review data:

select
    column0
from
    my_table
where
    date = '20170207'
    and column1 = 'value1'
    and column2 = 'value2'
    and column3 = 'value3'
    and column4 = 'value4'
    and column5 = 'value5'

After run this sql, we found the result is NULL.

4. what's the meaning of [0.0, 10000.0] in stacktrace?

In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an 
ArrayList which name is partialHistogram. Normally, the basic structure of 
partialHistogram is [npercentiles, percentile0, percentile1..., nbins, bin0.x, 
bin0.y, bin1.x, bin1.y,...]. However, if we process NULL(empty set) column 
values, the partialHistoram will only contains [npercentiles(0), nbins(10000)]. 
That's the reason why the stacktrace shows a strange row data: 
{"key":{},"value":{"_col0":[0.0,10000.0]}}

Before we call histogram#merge (on-line hisgoram algorithm from paper: 
http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the 
partialHistogram should remove elements which store percentiles like 
`partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty set, 
GenericUDAFPercentileApproxEvaluator will not remove percentiles. Consequently, 
NumericHistogram will merge a list which contains only 2 elements([0, 10000.0]) 
and throws IndexOutOfBoundsException. 

  was:
1. Original SQL:

select
    percentile_approx(
        column0,
        array(0.50, 0.70, 0.90, 0.95, 0.99)
    )
from
    my_table
where
    date = '20170207'
    and column1 = 'value1'
    and column2 = 'value2'
    and column3 = 'value3'
    and column4 = 'value4'
    and column5 = 'value5'

2. Exception StackTrace:

Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) ... 
7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766)
 at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235) 
... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at 
java.util.ArrayList.rangeCheck(ArrayList.java:653) at 
java.util.ArrayList.get(ArrayList.java:429) at 
org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134)
 at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318)
 at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)
 ... 8 more

3. review data:

select
    column0
from
    my_table
where
    date = '20170207'
    and column1 = 'value1'
    and column2 = 'value2'
    and column3 = 'value3'
    and column4 = 'value4'
    and column5 = 'value5'

After run this sql, we found the result is NULL.

4. what's the meaning of [0.0, 10000.0] in stacktrace?

In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an 
ArrayList which name is partialHistogram. Normally, the basic structure of 
partialHistogram is [npercentiles, percentile0, percentile1..., nbins, bin0.x, 
bin0.y, bin1.x, bin1.y,...]. However, if we are process NULL(empty set) column 
values, the partialHistoram will only contains [npercentiles(0), nbins(10000)]. 
That's the reason why the stacktrace shows a strange row data: 
{"key":{},"value":{"_col0":[0.0,10000.0]}}

Before we call histogram#merge (on-line hisgoram algorithm from paper: 
http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the 
partialHistogram should remove elements which store percentiles like 
`partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty set, 
GenericUDAFPercentileApproxEvaluator will not remove percentiles. Consequently, 
NumericHistogram will merge a list which contains only 2 elements([0, 10000.0]) 
and throws IndexOutOfBoundsException. 


> The PERCENTILE UDAF does not work with empty set
> ------------------------------------------------
>
>                 Key: HIVE-15872
>                 URL: https://issues.apache.org/jira/browse/HIVE-15872
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Chaozhong Yang
>            Assignee: Chaozhong Yang
>             Fix For: 2.1.2
>
>         Attachments: HIVE-15872.patch
>
>
> 1. Original SQL:
> select
>     percentile_approx(
>         column0,
>         array(0.50, 0.70, 0.90, 0.95, 0.99)
>     )
> from
>     my_table
> where
>     date = '20170207'
>     and column1 = 'value1'
>     and column2 = 'value2'
>     and column3 = 'value3'
>     and column4 = 'value4'
>     and column5 = 'value5'
> 2. Exception StackTrace:
> Error: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) 
> ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766)
>  at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235) 
> ... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 
> at java.util.ArrayList.rangeCheck(ArrayList.java:653) at 
> java.util.ArrayList.get(ArrayList.java:429) at 
> org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134)
>  at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318)
>  at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188)
>  at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
>  at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851)
>  at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)
>  ... 8 more
> 3. review data:
> select
>     column0
> from
>     my_table
> where
>     date = '20170207'
>     and column1 = 'value1'
>     and column2 = 'value2'
>     and column3 = 'value3'
>     and column4 = 'value4'
>     and column5 = 'value5'
> After run this sql, we found the result is NULL.
> 4. what's the meaning of [0.0, 10000.0] in stacktrace?
> In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an 
> ArrayList which name is partialHistogram. Normally, the basic structure of 
> partialHistogram is [npercentiles, percentile0, percentile1..., nbins, 
> bin0.x, bin0.y, bin1.x, bin1.y,...]. However, if we process NULL(empty set) 
> column values, the partialHistoram will only contains [npercentiles(0), 
> nbins(10000)]. That's the reason why the stacktrace shows a strange row data: 
> {"key":{},"value":{"_col0":[0.0,10000.0]}}
> Before we call histogram#merge (on-line hisgoram algorithm from paper: 
> http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the 
> partialHistogram should remove elements which store percentiles like 
> `partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty 
> set, GenericUDAFPercentileApproxEvaluator will not remove percentiles. 
> Consequently, NumericHistogram will merge a list which contains only 2 
> elements([0, 10000.0]) and throws IndexOutOfBoundsException. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15872) The PERCENTILE UDAF does not work with empty set

Reply via email to