Zheng, Thank you for your reply.
Well, it seems hard for me to repreduce this bug in a simpler query. However, *if I change the alias of subquery 't1' (either the inner one or the join result), the bug disappears*. I'm wondering if there is possible that table aliases of different level will conflict when their alias names are the same. 2010/5/12 Zheng Shao <[email protected]> > Yes that does seem to be a bug. > > Can you try if you can simply the query while reproducing the bug? > That will make it a lot easier to debug and fix. > > > Zheng > > On Tue, May 11, 2010 at 7:44 PM, Ted Xu <[email protected]> wrote: > > Hi all, > > I think I found a bug, I'm not sure whether the problem is at optimizer > > (PPD) or at map side aggregate. > > See query listed below: > > ------------------------------------- > > create table if not exists dm_fact_buyer_prd_info_d ( > > category_id string > > ,gmv_trade_num int > > ,user_id int > > ) > > PARTITIONED BY (ds int); > > set hive.optimize.ppd=true; > > set hive.map.aggr=true; > > explain select 20100426, category_id1,category_id2,assoc_idx > > from ( > > select > > category_id1 > > , category_id2 > > , count(distinct user_id) as assoc_idx > > from ( > > select > > t1.category_id as category_id1 > > , t2.category_id as category_id2 > > , t1.user_id > > from ( > > select category_id, user_id > > from dm_fact_buyer_prd_info_d > > where ds <= 20100426 > > and ds > 20100419 > > and category_id >0 > > and gmv_trade_num>0 > > group by category_id, user_id ) t1 > > join ( > > select category_id, user_id > > from dm_fact_buyer_prd_info_d > > where ds <= 20100426 > > and ds >20100419 > > and category_id >0 > > and gmv_trade_num >0 > > group by category_id, user_id ) t2 on t1.user_id=t2.user_id > > ) t1 > > group by category_id1, category_id2 ) t_o > > where category_id1 <> category_id2 > > and assoc_idx > 2; > > -------------------------------- > > The query above will fail when execute, throwing exception: "can not cast > > UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". > > I explained the query and the execute plan looks really wired (see > > the highlighted predicate): > > -------------------------------- > > ABSTRACT SYNTAX TREE: > > (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY > > (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_SUBQUERY (TOK_QUERY (TOK_FROM > > (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION > (TOK_DIR > > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id)) > > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<= > > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (> > > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num) > 0))) > > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id)))) > > t1) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF > > dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR > > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id)) > > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<= > > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (> > > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num) > 0))) > > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id)))) > > t2) (= (. (TOK_TABLE_OR_COL t1) user_id) (. (TOK_TABLE_OR_COL t2) > > user_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) > (TOK_SELECT > > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) category_id) category_id1) > > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) category_id) category_id2) > > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) user_id))))) t1)) (TOK_INSERT > > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL > > category_id2)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL > > user_id)) assoc_idx)) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id1) > > (TOK_TABLE_OR_COL category_id2)))) t_o)) (TOK_INSERT (TOK_DESTINATION > > (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 20100426) (TOK_SELEXPR > > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL > > category_id2)) (TOK_SELEXPR (TOK_TABLE_OR_COL assoc_idx))) (TOK_WHERE > (and > > (<> (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2)) (> > > (TOK_TABLE_OR_COL assoc_idx) 2))))) > > > > > > STAGE DEPENDENCIES: > > Stage-1 is a root stage > > Stage-2 depends on stages: Stage-1, Stage-4 > > Stage-3 depends on stages: Stage-2 > > Stage-4 is a root stage > > Stage-2 depends on stages: Stage-1, Stage-4 > > Stage-3 depends on stages: Stage-2 > > Stage-0 is a root stage > > > > > > STAGE PLANS: > > Stage: Stage-1 > > Map Reduce > > Alias -> Map Operator Tree: > > t_o:t1:t1:dm_fact_buyer_prd_info_d > > TableScan > > alias: dm_fact_buyer_prd_info_d > > Filter Operator > > predicate: > > expr: (((((UDFToDouble(ds) <= UDFToDouble(20100426)) > and > > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id) > > > > UDFToDouble(0))) and (gmv_trade_num > 0)) and (category_id <> user_id)) > > type: boolean > > Filter Operator > > predicate: > > expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426)) > and > > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id) > > > > UDFToDouble(0))) and (gmv_trade_num > 0)) > > type: boolean > > Select Operator > > expressions: > > expr: category_id > > type: string > > expr: user_id > > type: int > > outputColumnNames: category_id, user_id > > Group By Operator > > keys: > > expr: category_id > > type: string > > expr: user_id > > type: int > > mode: hash > > outputColumnNames: _col0, _col1 > > Reduce Output Operator > > key expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > sort order: ++ > > Map-reduce partition columns: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > tag: -1 > > Reduce Operator Tree: > > Group By Operator > > keys: > > expr: KEY._col0 > > type: string > > expr: KEY._col1 > > type: int > > mode: mergepartial > > outputColumnNames: _col0, _col1 > > Select Operator > > expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > outputColumnNames: _col0, _col1 > > File Output Operator > > compressed: true > > GlobalTableId: 0 > > table: > > input format: > > org.apache.hadoop.mapred.SequenceFileInputFormat > > output format: > > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > > > > > > Stage: Stage-2 > > Map Reduce > > Alias -> Map Operator Tree: > > $INTNAME > > Reduce Output Operator > > key expressions: > > expr: _col1 > > type: int > > sort order: + > > Map-reduce partition columns: > > expr: _col1 > > type: int > > tag: 0 > > value expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > $INTNAME1 > > Reduce Output Operator > > key expressions: > > expr: _col1 > > type: int > > sort order: + > > Map-reduce partition columns: > > expr: _col1 > > type: int > > tag: 1 > > value expressions: > > expr: _col0 > > type: string > > Reduce Operator Tree: > > Join Operator > > condition map: > > Inner Join 0 to 1 > > condition expressions: > > 0 {VALUE._col0} {VALUE._col1} > > 1 {VALUE._col0} > > outputColumnNames: _col0, _col1, _col2 > > Select Operator > > expressions: > > expr: _col0 > > type: string > > expr: _col2 > > type: string > > expr: _col1 > > type: int > > outputColumnNames: _col0, _col1, _col2 > > Select Operator > > expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > expr: _col2 > > type: int > > outputColumnNames: _col0, _col1, _col2 > > Group By Operator > > aggregations: > > expr: count(DISTINCT _col2) > > keys: > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > expr: _col2 > > type: int > > mode: hash > > outputColumnNames: _col0, _col1, _col2, _col3 > > File Output Operator > > compressed: true > > GlobalTableId: 0 > > table: > > input format: > > org.apache.hadoop.mapred.SequenceFileInputFormat > > output format: > > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > > > > > > Stage: Stage-3 > > Map Reduce > > Alias -> Map Operator Tree: > > file:/group/tbdev/shaojie/scratch/420686432\10003 > > Reduce Output Operator > > key expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > expr: _col2 > > type: int > > sort order: +++ > > Map-reduce partition columns: > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > tag: -1 > > value expressions: > > expr: _col3 > > type: bigint > > Reduce Operator Tree: > > Group By Operator > > aggregations: > > expr: count(DISTINCT KEY._col2) > > keys: > > expr: KEY._col0 > > type: string > > expr: KEY._col1 > > type: string > > mode: mergepartial > > outputColumnNames: _col0, _col1, _col2 > > Select Operator > > expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > expr: _col2 > > type: bigint > > outputColumnNames: _col0, _col1, _col2 > > Filter Operator > > predicate: > > expr: ((_col0 <> _col1) and (UDFToDouble(_col2) > > > UDFToDouble(2))) > > type: boolean > > Select Operator > > expressions: > > expr: 20100426 > > type: int > > expr: _col0 > > type: string > > expr: _col1 > > type: string > > expr: _col2 > > type: bigint > > outputColumnNames: _col0, _col1, _col2, _col3 > > File Output Operator > > compressed: false > > GlobalTableId: 0 > > table: > > input format: > org.apache.hadoop.mapred.TextInputFormat > > output format: > > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > > > > > > Stage: Stage-4 > > Map Reduce > > Alias -> Map Operator Tree: > > t_o:t1:t2:dm_fact_buyer_prd_info_d > > TableScan > > alias: dm_fact_buyer_prd_info_d > > Filter Operator > > predicate: > > expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426)) and > > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id) > > > > UDFToDouble(0))) and (gmv_trade_num > 0)) > > type: boolean > > Filter Operator > > predicate: > > expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426)) > and > > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id) > > > > UDFToDouble(0))) and (gmv_trade_num > 0)) > > type: boolean > > Select Operator > > expressions: > > expr: category_id > > type: string > > expr: user_id > > type: int > > outputColumnNames: category_id, user_id > > Group By Operator > > keys: > > expr: category_id > > type: string > > expr: user_id > > type: int > > mode: hash > > outputColumnNames: _col0, _col1 > > Reduce Output Operator > > key expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > sort order: ++ > > Map-reduce partition columns: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > tag: -1 > > Reduce Operator Tree: > > Group By Operator > > keys: > > expr: KEY._col0 > > type: string > > expr: KEY._col1 > > type: int > > mode: mergepartial > > outputColumnNames: _col0, _col1 > > Select Operator > > expressions: > > expr: _col0 > > type: string > > expr: _col1 > > type: int > > outputColumnNames: _col0, _col1 > > File Output Operator > > compressed: true > > GlobalTableId: 0 > > table: > > input format: > > org.apache.hadoop.mapred.SequenceFileInputFormat > > output format: > > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > > > > > > Stage: Stage-0 > > Fetch Operator > > limit: -1 > > -------------------------------- > > > > Well, I tried disabling predicate push down (set hive.optimize.ppd=true), > > the error is gone; I tried disabling map side aggregate, the error is > > gone,too. > > Anybody knows what the problem is? Please give me some advice. > > > > -- > > Best Regards, > > Ted Xu > > > > > > -- > Yours, > Zheng > http://www.linkedin.com/in/zshao > -- Best Regards, Ted Xu
