[
https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884796#action_12884796
]
John Sichi commented on HIVE-1342:
----------------------------------
Hmmm....I looked into this one some more. Let me summarize what I found.
On trunk as it is today (without this patch), predicate pushdown does not (in
general) get optimized when we have a nested select with a join. For example:
{noformat}
explain
SELECT * FROM (
SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) z
WHERE bar=3;
...
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
z:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
z:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col2
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: (_col2 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
However, it does kick in (sometimes correctly, sometimes incorrectly) in the
special case where aliases are reused. For example, it happens to work
correctly for a query like this:
{noformat}
explain
SELECT * FROM (
SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}
But in cases like the original ones in the bug reports, it gets applied
incorrectly.
Ted's patch attempts to limit the damage by uniformly preventing the
optimization from applying for the pattern of nested select over join
(regardless of whether aliases have been reused).
If this is the best we can do for 0.6, then we'll have to live with that and
then open another issue for correcting the real problem so that we can get full
optimization (particularly for views).
I don't think it's a question of keeping the implementation simple; the patch
as is does not fix the optimization, it just disables it.
> Predicate push down get error result when sub-queries have the same alias
> name
> -------------------------------------------------------------------------------
>
> Key: HIVE-1342
> URL: https://issues.apache.org/jira/browse/HIVE-1342
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.6.0
> Reporter: Ted Xu
> Assignee: Ted Xu
> Priority: Critical
> Fix For: 0.6.0
>
> Attachments: cmd.hql, explain, ppd_same_alias_1.patch,
> ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see
> the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> category_id string
> ,gmv_trade_num int
> ,user_id int
> )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> select
> category_id1
> , category_id2
> , count(distinct user_id) as assoc_idx
> from (
> select
> t1.category_id as category_id1
> , t2.category_id as category_id2
> , t1.user_id
> from (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t1
> join (
> select category_id, user_id
> from dm_fact_buyer_prd_info_d
> group by category_id, user_id ) t2 on
> t1.user_id=t2.user_id
> ) t1
> group by category_id1, category_id2 ) t_o
> where category_id1 <> category_id2
> and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast
> UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> I explained the query and the execute plan looks really wired ( only Stage-1,
> see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t_o:t1:t1:dm_fact_buyer_prd_info_d
> TableScan
> alias: dm_fact_buyer_prd_info_d
> Filter Operator
> predicate:
> expr: *(category_id <> user_id)*
> type: boolean
> Select Operator
> expressions:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> outputColumnNames: category_id, user_id
> Group By Operator
> keys:
> expr: category_id
> type: string
> expr: user_id
> type: bigint
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> tag: -1
> Reduce Operator Tree:
> Group By Operator
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: bigint
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is
> gone; I tried disabling map side aggregate, the error is gone,too.
> *Changing the alias of subquery 't1' (either the inner one or the join
> result), the bug disappears, too.*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.