[
https://issues.apache.org/jira/browse/HIVE-23893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162744#comment-17162744
]
Zhihua Deng commented on HIVE-23893:
------------------------------------
[~pgaref] the problem can be reproduced on master, like query:
set hive.cbo.enable=false;
create table a(k string) partitioned by(hs int);
create table b(k string) partitioned by(hs int);
explain extended select a.*, b.* from a join b on a.k = b.k where rand(100) <
0.1 and a.hs = 11 and b.hs = 10;
Query Plan:
{noformat}
STAGE PLANS:
Stage: Stage-1
Tez
#### A masked pattern was here ####
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE
Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats:
COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats:
COMPLETE Column stats: PARTIAL
tag: 0
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Map 3
Map Operator Tree:
TableScan
alias: b
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE
Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats:
COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats:
COMPLETE Column stats: PARTIAL
tag: 1
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Reducer 2
Execution mode: llap
Needs Tagging: false
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
keys:
0 k (type: string)
1 k (type: string)
outputColumnNames: _col0, _col1, _col5, _col6
Position of Big Table: 0
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE
Column stats: PARTIAL
Filter Operator
isSamplingPred: false
predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 =
10)) (type: boolean)
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE
Column stats: PARTIAL
Select Operator
expressions: _col0 (type: string), 11 (type: int), _col5
(type: string), 10 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 376 Basic stats:
COMPLETE Column stats: PARTIAL
File Output Operator
bucketingVersion: 2
compressed: false
GlobalTableId: 0
#### A masked pattern was here ####
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 376 Basic stats:
COMPLETE Column stats: PARTIAL
#### A masked pattern was here ####
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
bucketing_version -1
columns _col0,_col1,_col2,_col3
columns.types string:int:string:int
escape.delim \
hive.serialization.extend.additional.nesting.levels
true
serialization.escape.crlf true
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false{noformat}
> Extract deterministic conditions for pdd when the predicate contains
> non-deterministic function
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-23893
> URL: https://issues.apache.org/jira/browse/HIVE-23893
> Project: Hive
> Issue Type: Improvement
> Components: Logical Optimizer
> Reporter: Zhihua Deng
> Priority: Major
>
> Taken the following query for example, assume unix_timestamp is
> non-deterministic before version 1.3.0:
>
> {{SELECT}}
> {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') AS ft,}}
> {{ b.game_id AS game_id,}}
> {{ b.game_name AS game_name,}}
> {{ count(DISTINCT a.sha1_imei) uv}}
> {{FROM}}
> {{ gamesdk_userprofile a}}
> {{ JOIN game_info_all b ON a.appid = b.dev_app_id}}
> {{WHERE}}
> {{ a.date = 20200704}}
> {{ AND from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') =
> 20200704}}
> {{ AND b.date = 20200704}}
> {{GROUP BY}}
> {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd'),}}
> {{ b.game_id,}}
> {{ b.game_name}}
> {{ORDER BY}}
> {{ uv DESC}}
> {{LIMIT 200;}}
>
> The predicates(a.date = 20200704, b.date = 20200704) are unable to push down
> to join op, make the optimizer unable to prune partitions, which may result
> to a full scan on tables gamesdk_userprofile and game_info_all.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)