Re: BUG at optimizer or map side aggregate?

Zheng Shao Wed, 12 May 2010 01:13:20 -0700

Nice finding! That's likely to be the cause.
Can you open a JIRA issue on issues.apache.org/jira/browse/HIVE


Zheng

On Wed, May 12, 2010 at 1:05 AM, Ted Xu <[email protected]> wrote:
> Zheng,
> Thank you for your reply.
> Well, it seems hard for me to repreduce this bug in a simpler query.
> However, if I change the alias of subquery 't1' (either the inner one or the
> join result), the bug disappears. I'm wondering if there is possible that
> table aliases of different level will conflict when their alias names are
> the same.
>
> 2010/5/12 Zheng Shao <[email protected]>
>>
>> Yes that does seem to be a bug.
>>
>> Can you try if you can simply the query while reproducing the bug?
>> That will make it a lot easier to debug and fix.
>>
>>
>> Zheng
>>
>> On Tue, May 11, 2010 at 7:44 PM, Ted Xu <[email protected]> wrote:
>> > Hi all,
>> > I think I found a bug, I'm not sure whether the problem is at optimizer
>> > (PPD) or at map side aggregate.
>> > See query listed below:
>> > -------------------------------------
>> > create table if not exists dm_fact_buyer_prd_info_d (
>> > category_id string
>> > ,gmv_trade_num  int
>> > ,user_id    int
>> > )
>> > PARTITIONED BY (ds int);
>> > set hive.optimize.ppd=true;
>> > set hive.map.aggr=true;
>> > explain select 20100426, category_id1,category_id2,assoc_idx
>> > from (
>> > select
>> > category_id1
>> > , category_id2
>> > , count(distinct user_id) as assoc_idx
>> > from (
>> > select
>> > t1.category_id as category_id1
>> > , t2.category_id as category_id2
>> > , t1.user_id
>> > from (
>> > select category_id, user_id
>> > from dm_fact_buyer_prd_info_d
>> > where ds <= 20100426
>> > and ds > 20100419
>> > and category_id  >0
>> > and gmv_trade_num>0
>> > group by category_id, user_id ) t1
>> > join (
>> > select category_id, user_id
>> > from dm_fact_buyer_prd_info_d
>> > where ds <= 20100426
>> > and ds >20100419
>> > and category_id >0
>> > and gmv_trade_num >0
>> > group by category_id, user_id ) t2 on t1.user_id=t2.user_id
>> > ) t1
>> > group by category_id1, category_id2 ) t_o
>> > where category_id1 <> category_id2
>> > and assoc_idx > 2;
>> > --------------------------------
>> > The query above will fail when execute, throwing exception: "can not
>> > cast
>> > UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)".
>> > I explained the query and the execute plan looks really wired (see
>> > the highlighted predicate):
>> > --------------------------------
>> > ABSTRACT SYNTAX TREE:
>> >   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY
>> > (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_SUBQUERY (TOK_QUERY (TOK_FROM
>> > (TOK_TABREF dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION
>> > (TOK_DIR
>> > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
>> > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<=
>> > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (>
>> > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num)
>> > 0)))
>> > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL
>> > user_id))))
>> > t1) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
>> > dm_fact_buyer_prd_info_d)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
>> > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL category_id))
>> > (TOK_SELEXPR (TOK_TABLE_OR_COL user_id))) (TOK_WHERE (and (and (and (<=
>> > (TOK_TABLE_OR_COL ds) 20100426) (> (TOK_TABLE_OR_COL ds) 20100419)) (>
>> > (TOK_TABLE_OR_COL category_id) 0)) (> (TOK_TABLE_OR_COL gmv_trade_num)
>> > 0)))
>> > (TOK_GROUPBY (TOK_TABLE_OR_COL category_id) (TOK_TABLE_OR_COL
>> > user_id))))
>> > t2) (= (. (TOK_TABLE_OR_COL t1) user_id) (. (TOK_TABLE_OR_COL t2)
>> > user_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> > (TOK_SELECT
>> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) category_id) category_id1)
>> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) category_id) category_id2)
>> > (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) user_id))))) t1)) (TOK_INSERT
>> > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
>> > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
>> > category_id2)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL
>> > user_id)) assoc_idx)) (TOK_GROUPBY (TOK_TABLE_OR_COL category_id1)
>> > (TOK_TABLE_OR_COL category_id2)))) t_o)) (TOK_INSERT (TOK_DESTINATION
>> > (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 20100426) (TOK_SELEXPR
>> > (TOK_TABLE_OR_COL category_id1)) (TOK_SELEXPR (TOK_TABLE_OR_COL
>> > category_id2)) (TOK_SELEXPR (TOK_TABLE_OR_COL assoc_idx))) (TOK_WHERE
>> > (and
>> > (<> (TOK_TABLE_OR_COL category_id1) (TOK_TABLE_OR_COL category_id2)) (>
>> > (TOK_TABLE_OR_COL assoc_idx) 2)))))
>> >
>> >
>> > STAGE DEPENDENCIES:
>> >   Stage-1 is a root stage
>> >   Stage-2 depends on stages: Stage-1, Stage-4
>> >   Stage-3 depends on stages: Stage-2
>> >   Stage-4 is a root stage
>> >   Stage-2 depends on stages: Stage-1, Stage-4
>> >   Stage-3 depends on stages: Stage-2
>> >   Stage-0 is a root stage
>> >
>> >
>> > STAGE PLANS:
>> >   Stage: Stage-1
>> >     Map Reduce
>> >       Alias -> Map Operator Tree:
>> >         t_o:t1:t1:dm_fact_buyer_prd_info_d
>> >           TableScan
>> >             alias: dm_fact_buyer_prd_info_d
>> >             Filter Operator
>> >               predicate:
>> >                   expr: (((((UDFToDouble(ds) <= UDFToDouble(20100426))
>> > and
>> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
>> > >
>> > UDFToDouble(0))) and (gmv_trade_num > 0)) and (category_id <> user_id))
>> >                   type: boolean
>> >               Filter Operator
>> >                 predicate:
>> >                     expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426))
>> > and
>> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
>> > >
>> > UDFToDouble(0))) and (gmv_trade_num > 0))
>> >                     type: boolean
>> >                 Select Operator
>> >                   expressions:
>> >                         expr: category_id
>> >                         type: string
>> >                         expr: user_id
>> >                         type: int
>> >                   outputColumnNames: category_id, user_id
>> >                   Group By Operator
>> >                     keys:
>> >                           expr: category_id
>> >                           type: string
>> >                           expr: user_id
>> >                           type: int
>> >                     mode: hash
>> >                     outputColumnNames: _col0, _col1
>> >                     Reduce Output Operator
>> >                       key expressions:
>> >                             expr: _col0
>> >                             type: string
>> >                             expr: _col1
>> >                             type: int
>> >                       sort order: ++
>> >                       Map-reduce partition columns:
>> >                             expr: _col0
>> >                             type: string
>> >                             expr: _col1
>> >                             type: int
>> >                       tag: -1
>> >       Reduce Operator Tree:
>> >         Group By Operator
>> >           keys:
>> >                 expr: KEY._col0
>> >                 type: string
>> >                 expr: KEY._col1
>> >                 type: int
>> >           mode: mergepartial
>> >           outputColumnNames: _col0, _col1
>> >           Select Operator
>> >             expressions:
>> >                   expr: _col0
>> >                   type: string
>> >                   expr: _col1
>> >                   type: int
>> >             outputColumnNames: _col0, _col1
>> >             File Output Operator
>> >               compressed: true
>> >               GlobalTableId: 0
>> >               table:
>> >                   input format:
>> > org.apache.hadoop.mapred.SequenceFileInputFormat
>> >                   output format:
>> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> >
>> >
>> >   Stage: Stage-2
>> >     Map Reduce
>> >       Alias -> Map Operator Tree:
>> >         $INTNAME
>> >             Reduce Output Operator
>> >               key expressions:
>> >                     expr: _col1
>> >                     type: int
>> >               sort order: +
>> >               Map-reduce partition columns:
>> >                     expr: _col1
>> >                     type: int
>> >               tag: 0
>> >               value expressions:
>> >                     expr: _col0
>> >                     type: string
>> >                     expr: _col1
>> >                     type: int
>> >         $INTNAME1
>> >             Reduce Output Operator
>> >               key expressions:
>> >                     expr: _col1
>> >                     type: int
>> >               sort order: +
>> >               Map-reduce partition columns:
>> >                     expr: _col1
>> >                     type: int
>> >               tag: 1
>> >               value expressions:
>> >                     expr: _col0
>> >                     type: string
>> >       Reduce Operator Tree:
>> >         Join Operator
>> >           condition map:
>> >                Inner Join 0 to 1
>> >           condition expressions:
>> >             0 {VALUE._col0} {VALUE._col1}
>> >             1 {VALUE._col0}
>> >           outputColumnNames: _col0, _col1, _col2
>> >           Select Operator
>> >             expressions:
>> >                   expr: _col0
>> >                   type: string
>> >                   expr: _col2
>> >                   type: string
>> >                   expr: _col1
>> >                   type: int
>> >             outputColumnNames: _col0, _col1, _col2
>> >             Select Operator
>> >               expressions:
>> >                     expr: _col0
>> >                     type: string
>> >                     expr: _col1
>> >                     type: string
>> >                     expr: _col2
>> >                     type: int
>> >               outputColumnNames: _col0, _col1, _col2
>> >               Group By Operator
>> >                 aggregations:
>> >                       expr: count(DISTINCT _col2)
>> >                 keys:
>> >                       expr: _col0
>> >                       type: string
>> >                       expr: _col1
>> >                       type: string
>> >                       expr: _col2
>> >                       type: int
>> >                 mode: hash
>> >                 outputColumnNames: _col0, _col1, _col2, _col3
>> >                 File Output Operator
>> >                   compressed: true
>> >                   GlobalTableId: 0
>> >                   table:
>> >                       input format:
>> > org.apache.hadoop.mapred.SequenceFileInputFormat
>> >                       output format:
>> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> >
>> >
>> >   Stage: Stage-3
>> >     Map Reduce
>> >       Alias -> Map Operator Tree:
>> >         file:/group/tbdev/shaojie/scratch/420686432\10003
>> >             Reduce Output Operator
>> >               key expressions:
>> >                     expr: _col0
>> >                     type: string
>> >                     expr: _col1
>> >                     type: string
>> >                     expr: _col2
>> >                     type: int
>> >               sort order: +++
>> >               Map-reduce partition columns:
>> >                     expr: _col0
>> >                     type: string
>> >                     expr: _col1
>> >                     type: string
>> >               tag: -1
>> >               value expressions:
>> >                     expr: _col3
>> >                     type: bigint
>> >       Reduce Operator Tree:
>> >         Group By Operator
>> >           aggregations:
>> >                 expr: count(DISTINCT KEY._col2)
>> >           keys:
>> >                 expr: KEY._col0
>> >                 type: string
>> >                 expr: KEY._col1
>> >                 type: string
>> >           mode: mergepartial
>> >           outputColumnNames: _col0, _col1, _col2
>> >           Select Operator
>> >             expressions:
>> >                   expr: _col0
>> >                   type: string
>> >                   expr: _col1
>> >                   type: string
>> >                   expr: _col2
>> >                   type: bigint
>> >             outputColumnNames: _col0, _col1, _col2
>> >             Filter Operator
>> >               predicate:
>> >                   expr: ((_col0 <> _col1) and (UDFToDouble(_col2) >
>> > UDFToDouble(2)))
>> >                   type: boolean
>> >               Select Operator
>> >                 expressions:
>> >                       expr: 20100426
>> >                       type: int
>> >                       expr: _col0
>> >                       type: string
>> >                       expr: _col1
>> >                       type: string
>> >                       expr: _col2
>> >                       type: bigint
>> >                 outputColumnNames: _col0, _col1, _col2, _col3
>> >                 File Output Operator
>> >                   compressed: false
>> >                   GlobalTableId: 0
>> >                   table:
>> >                       input format:
>> > org.apache.hadoop.mapred.TextInputFormat
>> >                       output format:
>> > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>> >
>> >
>> >   Stage: Stage-4
>> >     Map Reduce
>> >       Alias -> Map Operator Tree:
>> >         t_o:t1:t2:dm_fact_buyer_prd_info_d
>> >           TableScan
>> >             alias: dm_fact_buyer_prd_info_d
>> >             Filter Operator
>> >               predicate:
>> >                   expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426))
>> > and
>> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
>> > >
>> > UDFToDouble(0))) and (gmv_trade_num > 0))
>> >                   type: boolean
>> >               Filter Operator
>> >                 predicate:
>> >                     expr: ((((UDFToDouble(ds) <= UDFToDouble(20100426))
>> > and
>> > (UDFToDouble(ds) > UDFToDouble(20100419))) and (UDFToDouble(category_id)
>> > >
>> > UDFToDouble(0))) and (gmv_trade_num > 0))
>> >                     type: boolean
>> >                 Select Operator
>> >                   expressions:
>> >                         expr: category_id
>> >                         type: string
>> >                         expr: user_id
>> >                         type: int
>> >                   outputColumnNames: category_id, user_id
>> >                   Group By Operator
>> >                     keys:
>> >                           expr: category_id
>> >                           type: string
>> >                           expr: user_id
>> >                           type: int
>> >                     mode: hash
>> >                     outputColumnNames: _col0, _col1
>> >                     Reduce Output Operator
>> >                       key expressions:
>> >                             expr: _col0
>> >                             type: string
>> >                             expr: _col1
>> >                             type: int
>> >                       sort order: ++
>> >                       Map-reduce partition columns:
>> >                             expr: _col0
>> >                             type: string
>> >                             expr: _col1
>> >                             type: int
>> >                       tag: -1
>> >       Reduce Operator Tree:
>> >         Group By Operator
>> >           keys:
>> >                 expr: KEY._col0
>> >                 type: string
>> >                 expr: KEY._col1
>> >                 type: int
>> >           mode: mergepartial
>> >           outputColumnNames: _col0, _col1
>> >           Select Operator
>> >             expressions:
>> >                   expr: _col0
>> >                   type: string
>> >                   expr: _col1
>> >                   type: int
>> >             outputColumnNames: _col0, _col1
>> >             File Output Operator
>> >               compressed: true
>> >               GlobalTableId: 0
>> >               table:
>> >                   input format:
>> > org.apache.hadoop.mapred.SequenceFileInputFormat
>> >                   output format:
>> > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> >
>> >
>> >   Stage: Stage-0
>> >     Fetch Operator
>> >       limit: -1
>> > --------------------------------
>> >
>> > Well, I tried disabling predicate push down (set
>> > hive.optimize.ppd=true),
>> > the error is gone; I tried disabling map side aggregate, the error is
>> > gone,too.
>> > Anybody knows what the problem is? Please give me some advice.
>> >
>> > --
>> > Best Regards,
>> > Ted Xu
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>> http://www.linkedin.com/in/zshao
>
>
>
> --
> Best Regards,
> Ted Xu
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: BUG at optimizer or map side aggregate?

Reply via email to