[
https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885101#action_12885101
]
Ted Xu commented on HIVE-1342:
------------------------------
The patch is not simply disables PPD, when encountered the special case (nested
select over join) . It prevents replicated table resolve.
I tried the query above and it seems fine with the patch, that is, the
predicate can be pushed into the subquery. The explain result is shown below:
{code}
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
z:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: string
sort order: +
Map-reduce partition columns:
expr: foo
type: string
tag: 0
value expressions:
expr: foo
type: string
expr: bar
type: string
z:b
TableScan
alias: b
Filter Operator
predicate:
expr: (UDFToDouble(foo) = UDFToDouble(3))
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: string
sort order: +
Map-reduce partition columns:
expr: foo
type: string
tag: 1
value expressions:
expr: foo
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: string
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (UDFToDouble(_col2) = UDFToDouble(3))
type: boolean
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{code}
I think the reason why trunk version cannot push predicate into the subquery is
that it did a replicated table resolve therefore can't find any table suitable
for that predicate, not disabling PPD purposely.
> Predicate push down get error result when sub-queries have the same alias
> name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch,
> ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see
> the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on
> t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast
> UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1,
> see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is
> gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join
> result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.