Re: BUG at optimizer or map side aggregate?

Ted Xu Wed, 12 May 2010 01:06:02 -0700

Zheng,

Thank you for your reply.


Well, it seems hard for me to repreduce this bug in a simpler query.

However, *if I change the alias of subquery 't1' (either the inner one or
the join result), the bug disappears*. I'm wondering if there is possible
that table aliases of different level will conflict when their alias names
are the same.

2010/5/12 Zheng Shao <[email protected]>

> Yes that does seem to be a bug.
>
> Can you try if you can simply the query while reproducing the bug?
> That will make it a lot easier to debug and fix.
>
>
> Zheng
>
> On Tue, May 11, 2010 at 7:44 PM, Ted Xu <[email protected]> wrote:
> > Hi all,
> > I think I found a bug, I'm not sure whether the problem is at optimizer
> > (PPD) or at map side aggregate.
> > See query listed below:
> > -------------------------------------
> > create table if not exists dm_fact_buyer_prd_info_d (
> > category_id string
> > ,gmv_trade_num  int
> > ,user_id    int
> > )
> > PARTITIONED BY (ds int);
> > set hive.optimize.ppd=true;
> > set hive.map.aggr=true;
> > explain select 20100426, category_id1,category_id2,assoc_idx
> > from (
> > select
> > category_id1
> > , category_id2
> > , count(distinct user_id) as assoc_idx
> > from (
> > select
> > t1.category_id as category_id1
> > , t2.category_id as category_id2
> > , t1.user_id
> > from (
> > select category_id, user_id
> > from dm_fact_buyer_prd_info_d
> > where ds <= 20100426
> > and ds > 20100419
> > and category_id  >0
> > and gmv_trade_num>0
> > group by category_id, user_id ) t1
> > join (
> > select category_id, user_id
> > from dm_fact_buyer_prd_info_d
> > where ds <= 20100426
> > and ds >20100419
> > and category_id >0
> > and gmv_trade_num >0
> > group by category_id, user_id ) t2 on t1.user_id=t2.user_id
> > ) t1
> > group by category_id1, category_id2 ) t_o
> > where category_id1 <> category_id2
> > and assoc_idx > 2;
> > --------------------------------
> > The query above will fail when execute, throwing exception: "can not cast
> > UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
> > I explained the query and the execute plan looks really wired (see
> > the highlighted predicate):
> > --------------------------------
> > ABSTRACT SYNTAX TREE:
> >   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY
> > (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_SUBQUERY (TOK_QUERY (TOK_FROM
> > (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION
> (TOK_DIR
> > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
> > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<=
> > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (>
> > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num)
> 0)))
> > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id))))
> > t1) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
> > dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
> > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<=
> > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (>
> > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num)
> 0)))
> > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL user_id))))
> > t2) (= (. (TOK_TABLE_OR_COL t1) user_id) (. (TOK_TABLE_OR_COL t2)
> > user_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT
> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) category_id) category_id1)
> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) category_id) category_id2)
> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) user_id))))) t1)) (TOK_INSERT
> > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
> > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> > category_id2)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL
> > user_id)) assoc_idx)) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id1)
> > (TOK_TABLE_OR_COL category_id2)))) t_o)) (TOK_INSERT (TOK_DESTINATION
> > (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 20100426) (TOK_SELEXPR
> > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> > category_id2)) (TOK_SELEXPR (TOK_TABLE_OR_COL assoc_idx))) (TOK_WHERE
> (and
> > (<> (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2)) (>
> > (TOK_TABLE_OR_COL assoc_idx) 2)))))
> >
> >
> > STAGE DEPENDENCIES:
> >   Stage-1 is a root stage
> >   Stage-2 depends on stages: Stage-1, Stage-4
> >   Stage-3 depends on stages: Stage-2
> >   Stage-4 is a root stage
> >   Stage-2 depends on stages: Stage-1, Stage-4
> >   Stage-3 depends on stages: Stage-2
> >   Stage-0 is a root stage
> >
> >
> > STAGE PLANS:
> >   Stage: Stage-1
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         t_o:t1:t1:dm_fact_buyer_prd_info_d
> >           TableScan
> >             alias: dm_fact_buyer_prd_info_d
> >             Filter Operator
> >               predicate:
> >                   expr: (((((UDFToDouble(ds) <= UDFToDouble(20100426))
> and
> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
> >
> > UDFToDouble(0))) and (gmv_trade_num > 0)) and (category_id <> user_id))
> >                   type: boolean
> >               Filter Operator
> >                 predicate:
> >                     expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426))
> and
> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
> >
> > UDFToDouble(0))) and (gmv_trade_num > 0))
> >                     type: boolean
> >                 Select Operator
> >                   expressions:
> >                         expr: category_id
> >                         type: string
> >                         expr: user_id
> >                         type: int
> >                   outputColumnNames: category_id, user_id
> >                   Group By Operator
> >                     keys:
> >                           expr: category_id
> >                           type: string
> >                           expr: user_id
> >                           type: int
> >                     mode: hash
> >                     outputColumnNames: _col0, _col1
> >                     Reduce Output Operator
> >                       key expressions:
> >                             expr: _col0
> >                             type: string
> >                             expr: _col1
> >                             type: int
> >                       sort order: ++
> >                       Map-reduce partition columns:
> >                             expr: _col0
> >                             type: string
> >                             expr: _col1
> >                             type: int
> >                       tag: -1
> >       Reduce Operator Tree:
> >         Group By Operator
> >           keys:
> >                 expr: KEY._col0
> >                 type: string
> >                 expr: KEY._col1
> >                 type: int
> >           mode: mergepartial
> >           outputColumnNames: _col0, _col1
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col1
> >                   type: int
> >             outputColumnNames: _col0, _col1
> >             File Output Operator
> >               compressed: true
> >               GlobalTableId: 0
> >               table:
> >                   input format:
> > org.apache.hadoop.mapred.SequenceFileInputFormat
> >                   output format:
> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> >
> >
> >   Stage: Stage-2
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         $INTNAME
> >             Reduce Output Operator
> >               key expressions:
> >                     expr: _col1
> >                     type: int
> >               sort order: +
> >               Map-reduce partition columns:
> >                     expr: _col1
> >                     type: int
> >               tag: 0
> >               value expressions:
> >                     expr: _col0
> >                     type: string
> >                     expr: _col1
> >                     type: int
> >         $INTNAME1
> >             Reduce Output Operator
> >               key expressions:
> >                     expr: _col1
> >                     type: int
> >               sort order: +
> >               Map-reduce partition columns:
> >                     expr: _col1
> >                     type: int
> >               tag: 1
> >               value expressions:
> >                     expr: _col0
> >                     type: string
> >       Reduce Operator Tree:
> >         Join Operator
> >           condition map:
> >                Inner Join 0 to 1
> >           condition expressions:
> >             0 {VALUE._col0} {VALUE._col1}
> >             1 {VALUE._col0}
> >           outputColumnNames: _col0, _col1, _col2
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col2
> >                   type: string
> >                   expr: _col1
> >                   type: int
> >             outputColumnNames: _col0, _col1, _col2
> >             Select Operator
> >               expressions:
> >                     expr: _col0
> >                     type: string
> >                     expr: _col1
> >                     type: string
> >                     expr: _col2
> >                     type: int
> >               outputColumnNames: _col0, _col1, _col2
> >               Group By Operator
> >                 aggregations:
> >                       expr: count(DISTINCT _col2)
> >                 keys:
> >                       expr: _col0
> >                       type: string
> >                       expr: _col1
> >                       type: string
> >                       expr: _col2
> >                       type: int
> >                 mode: hash
> >                 outputColumnNames: _col0, _col1, _col2, _col3
> >                 File Output Operator
> >                   compressed: true
> >                   GlobalTableId: 0
> >                   table:
> >                       input format:
> > org.apache.hadoop.mapred.SequenceFileInputFormat
> >                       output format:
> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> >
> >
> >   Stage: Stage-3
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         file:/group/tbdev/shaojie/scratch/420686432\10003
> >             Reduce Output Operator
> >               key expressions:
> >                     expr: _col0
> >                     type: string
> >                     expr: _col1
> >                     type: string
> >                     expr: _col2
> >                     type: int
> >               sort order: +++
> >               Map-reduce partition columns:
> >                     expr: _col0
> >                     type: string
> >                     expr: _col1
> >                     type: string
> >               tag: -1
> >               value expressions:
> >                     expr: _col3
> >                     type: bigint
> >       Reduce Operator Tree:
> >         Group By Operator
> >           aggregations:
> >                 expr: count(DISTINCT KEY._col2)
> >           keys:
> >                 expr: KEY._col0
> >                 type: string
> >                 expr: KEY._col1
> >                 type: string
> >           mode: mergepartial
> >           outputColumnNames: _col0, _col1, _col2
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col1
> >                   type: string
> >                   expr: _col2
> >                   type: bigint
> >             outputColumnNames: _col0, _col1, _col2
> >             Filter Operator
> >               predicate:
> >                   expr: ((_col0 <> _col1) and (UDFToDouble(_col2) >
> > UDFToDouble(2)))
> >                   type: boolean
> >               Select Operator
> >                 expressions:
> >                       expr: 20100426
> >                       type: int
> >                       expr: _col0
> >                       type: string
> >                       expr: _col1
> >                       type: string
> >                       expr: _col2
> >                       type: bigint
> >                 outputColumnNames: _col0, _col1, _col2, _col3
> >                 File Output Operator
> >                   compressed: false
> >                   GlobalTableId: 0
> >                   table:
> >                       input format:
> org.apache.hadoop.mapred.TextInputFormat
> >                       output format:
> > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> >
> >
> >   Stage: Stage-4
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         t_o:t1:t2:dm_fact_buyer_prd_info_d
> >           TableScan
> >             alias: dm_fact_buyer_prd_info_d
> >             Filter Operator
> >               predicate:
> >                   expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426)) and
> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
> >
> > UDFToDouble(0))) and (gmv_trade_num > 0))
> >                   type: boolean
> >               Filter Operator
> >                 predicate:
> >                     expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426))
> and
> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
> >
> > UDFToDouble(0))) and (gmv_trade_num > 0))
> >                     type: boolean
> >                 Select Operator
> >                   expressions:
> >                         expr: category_id
> >                         type: string
> >                         expr: user_id
> >                         type: int
> >                   outputColumnNames: category_id, user_id
> >                   Group By Operator
> >                     keys:
> >                           expr: category_id
> >                           type: string
> >                           expr: user_id
> >                           type: int
> >                     mode: hash
> >                     outputColumnNames: _col0, _col1
> >                     Reduce Output Operator
> >                       key expressions:
> >                             expr: _col0
> >                             type: string
> >                             expr: _col1
> >                             type: int
> >                       sort order: ++
> >                       Map-reduce partition columns:
> >                             expr: _col0
> >                             type: string
> >                             expr: _col1
> >                             type: int
> >                       tag: -1
> >       Reduce Operator Tree:
> >         Group By Operator
> >           keys:
> >                 expr: KEY._col0
> >                 type: string
> >                 expr: KEY._col1
> >                 type: int
> >           mode: mergepartial
> >           outputColumnNames: _col0, _col1
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col1
> >                   type: int
> >             outputColumnNames: _col0, _col1
> >             File Output Operator
> >               compressed: true
> >               GlobalTableId: 0
> >               table:
> >                   input format:
> > org.apache.hadoop.mapred.SequenceFileInputFormat
> >                   output format:
> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> >
> >
> >   Stage: Stage-0
> >     Fetch Operator
> >       limit: -1
> > --------------------------------
> >
> > Well, I tried disabling predicate push down (set hive.optimize.ppd=true),
> > the error is gone; I tried disabling map side aggregate, the error is
> > gone,too.
> > Anybody knows what the problem is? Please give me some advice.
> >
> > --
> > Best Regards,
> > Ted Xu
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>



-- 
Best Regards,
Ted Xu

Re: BUG at optimizer or map side aggregate?

Reply via email to