[ 
https://issues.apache.org/jira/browse/HIVE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Rebele updated HIVE-29364:
---------------------------------
    Description: 
Consider the following queries executed within the context of 
[{{stats_histogram.q}}|https://github.com/apache/hive/blob/fe70ee9efcdf06a5580d51bb2be9698289e791ce/ql/src/test/queries/clientpositive/stats_histogram.q].
 The histogram statistics are used for the predicates without a CAST:
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE b < 6;
SELECT COUNT(*) FROM test_stats WHERE b < 6;

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;
SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;{code}
The rowcount estimate for the filter {{b<6}} is 11, for the filter {{cast(b as 
BIGINT)}} is 5; the {{COUNT( * )}} result is 11 for both.

(I didn't check {{{}b BETWEEN ...{}}}, though I think the same issue occurs 
there as well)

 
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";
SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= 
"2020-11-7 00:00:00";
SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= "2020-11-7 
00:00:00";
{code}
The rowcount estimate for the filter {{g >= "2020-11-7"}} is 1, for the filter 
{{cast(g as timestamp) >= "2020-11-7 00:00:00"}} is 5; the {{COUNT( * )}} 
result is 1 for both.

 
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND 
"2020-11-06";
SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND "2020-11-06";

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN 
"2020-11-01 00:00:00" AND "2020-11-06 00:00:00";
SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN "2020-11-01 
00:00:00" AND "2020-11-06 00:00:00"; {code}
The rowcount estimate for the filter {{g BETWEEN "2020-11-01" AND 
"2020-11-06"}} is 11, for the filter {{cast(g as timestamp) BETWEEN "2020-11-01 
00:00:00" AND "2020-11-06 00:00:00"}} is 1; the {{COUNT( * )}} result is 12 for 
both.

 
----
Some pathological cases to be avoided: if the CAST throws away information, the 
histogram statistics should not be used. E.g., {{{}CAST(128 as TINYINT) = 
-128{}}}, so a {{CAST(intField as TINYINT) > 42}} should not be treated the 
same as {{{}intField > 42{}}}, if the intField is outside of the range of 
TINYINT.

  was:
Consider the following queries executed within the context of 
[{{stats_histogram.q}}|https://github.com/apache/hive/blob/fe70ee9efcdf06a5580d51bb2be9698289e791ce/ql/src/test/queries/clientpositive/stats_histogram.q].
 The histogram statistics are used for the predicates without a CAST:
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE b < 6;
SELECT COUNT(*) FROM test_stats WHERE b < 6;

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;
SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;{code}
The rowcount estimate for the filter {{b<6}} is 11, for the filter {{cast(b as 
BIGINT)}} is 5; the {{COUNT(*)}} result is 11 for both.

(I didn't check {{{}b BETWEEN ...{}}}, though I think the same issue occurs 
there as well)

 
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";
SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= 
"2020-11-7 00:00:00";
SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= "2020-11-7 
00:00:00";
{code}
The rowcount estimate for the filter {{g >= "2020-11-7"}} is 1, for the filter 
{{cast(g as timestamp) >= "2020-11-7 00:00:00"}} is 5; the {{COUNT(*)}} result 
is 1 for both.

 
{code:java}
EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND 
"2020-11-06";
SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND "2020-11-06";

EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN 
"2020-11-01 00:00:00" AND "2020-11-06 00:00:00";
SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN "2020-11-01 
00:00:00" AND "2020-11-06 00:00:00"; {code}
The rowcount estimate for the filter {{g BETWEEN "2020-11-01" AND 
"2020-11-06"}} is 11, for the filter {{cast(g as timestamp) BETWEEN "2020-11-01 
00:00:00" AND "2020-11-06 00:00:00"}} is 1; the {{COUNT(*)}} result is 12 for 
both.

 
----
Some pathological cases to be avoided: if the CAST throws away information, the 
histogram statistics should not be used. E.g., {{{}CAST(128 as TINYINT) = 
-128{}}}, so a {{CAST(intField as TINYINT) > 42}} should not be treated the 
same as {{{}intField > 42{}}}, if the intField is outside of the range of 
TINYINT.


> Use histogram statistics for range predicates with a CAST
> ---------------------------------------------------------
>
>                 Key: HIVE-29364
>                 URL: https://issues.apache.org/jira/browse/HIVE-29364
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Thomas Rebele
>            Priority: Major
>
> Consider the following queries executed within the context of 
> [{{stats_histogram.q}}|https://github.com/apache/hive/blob/fe70ee9efcdf06a5580d51bb2be9698289e791ce/ql/src/test/queries/clientpositive/stats_histogram.q].
>  The histogram statistics are used for the predicates without a CAST:
> {code:java}
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE b < 6;
> SELECT COUNT(*) FROM test_stats WHERE b < 6;
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;
> SELECT COUNT(*) FROM test_stats WHERE cast(b as BIGINT) < 6;{code}
> The rowcount estimate for the filter {{b<6}} is 11, for the filter {{cast(b 
> as BIGINT)}} is 5; the {{COUNT( * )}} result is 11 for both.
> (I didn't check {{{}b BETWEEN ...{}}}, though I think the same issue occurs 
> there as well)
>  
> {code:java}
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";
> SELECT COUNT(*) FROM test_stats WHERE g >= "2020-11-7";
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= 
> "2020-11-7 00:00:00";
> SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) >= "2020-11-7 
> 00:00:00";
> {code}
> The rowcount estimate for the filter {{g >= "2020-11-7"}} is 1, for the 
> filter {{cast(g as timestamp) >= "2020-11-7 00:00:00"}} is 5; the {{COUNT( * 
> )}} result is 1 for both.
>  
> {code:java}
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND 
> "2020-11-06";
> SELECT COUNT(*) FROM test_stats WHERE g BETWEEN "2020-11-01" AND "2020-11-06";
> EXPLAIN SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN 
> "2020-11-01 00:00:00" AND "2020-11-06 00:00:00";
> SELECT COUNT(*) FROM test_stats WHERE cast(g as timestamp) BETWEEN 
> "2020-11-01 00:00:00" AND "2020-11-06 00:00:00"; {code}
> The rowcount estimate for the filter {{g BETWEEN "2020-11-01" AND 
> "2020-11-06"}} is 11, for the filter {{cast(g as timestamp) BETWEEN 
> "2020-11-01 00:00:00" AND "2020-11-06 00:00:00"}} is 1; the {{COUNT( * )}} 
> result is 12 for both.
>  
> ----
> Some pathological cases to be avoided: if the CAST throws away information, 
> the histogram statistics should not be used. E.g., {{{}CAST(128 as TINYINT) = 
> -128{}}}, so a {{CAST(intField as TINYINT) > 42}} should not be treated the 
> same as {{{}intField > 42{}}}, if the intField is outside of the range of 
> TINYINT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to