[ 
https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1342:
-----------------------------

      Status: Open  (was: Patch Available)
    Assignee: Ted Xu

Hi Ted,

I don't think this patch is general enough.  To really fix the problem, it will 
be necessary to dig into it deeper and find out where we are currently using 
unscoped aliases (e.g. t2) where we should instead be using scoped aliases 
(e.g. t_o:t1:t2:dm_fact_buyer_prd_info_d).  I suspect that if you get to the 
root of the problem, the fix will take care of HIVE-1395 too.

Also, when submitting a patch, please run diff from the hive trunk directory 
(not from a subdirectory such as ql).


> Predicate push down get error result when sub-queries have the same alias 
> name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see 
> the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
>               category_id string
>               ,gmv_trade_num  int
>               ,user_id    int
>               )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
>               select 
>                       category_id1
>                       , category_id2
>                       , count(distinct user_id) as assoc_idx
>               from (
>                       select 
>                               t1.category_id as category_id1
>                               , t2.category_id as category_id2
>                               , t1.user_id
>                       from (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t1
>                       join (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t2 on 
> t1.user_id=t2.user_id 
>                       ) t1
>                       group by category_id1, category_id2 ) t_o
>                       where category_id1 <> category_id2
>                       and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast 
> UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, 
> see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is 
> gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join 
> result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to