[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163762#comment-16163762
 ] 

liyunzhang_intel commented on HIVE-17474:
-----------------------------------------

[~lirui] , [~xuefuz]: after debugging in tez, found the part2 join part1 is 
common merge join(CommonMergeJoinOperator).
{code}
  Reducer 2 
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col7 (type: string)
                  1 _col0 (type: string)

{code}


the implementation of CommonMergeJoin. Does hive on spark enable 
CommonMergeJoin?
{code}
/*
 * With an aim to consolidate the join algorithms to either hash based joins 
(MapJoinOperator) or
 * sort-merge based joins, this operator is being introduced. This operator 
executes a sort-merge
 * based algorithm. It replaces both the JoinOperator and the 
SMBMapJoinOperator for the tez side of
 * things. It works in either the map phase or reduce phase.
 *
 * The basic algorithm is as follows:
 *
 * 1. The processOp receives a row from a "big" table.
 * 2. In order to process it, the operator does a fetch for rows from the other 
tables.
 * 3. Once we have a set of rows from the other tables (till we hit a new key), 
more rows are
 *    brought in from the big table and a join is performed.
 */
{code}

> Poor Performance about subquery like DS/query70 on HoS
> ------------------------------------------------------
>
>                 Key: HIVE-17474
>                 URL: https://issues.apache.org/jira/browse/HIVE-17474
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: explain.70.vec
>
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
>     sum(ss_net_profit) as total_sum
>    ,s_state
>    ,s_county
>    ,grouping__id as lochierarchy
>    , rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
>     store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
>     join store s on s.s_store_sk  = ss.ss_store_sk
>  where
>     d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>              ( select s_state
>                from  (select s_state as s_state, sum(ss_net_profit),
>                              rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>                       from   store_sales, store, date_dim
>                       where  d_month_seq between 1193 and 1193+11
>                             and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
>                             and store.s_store_sk  = store_sales.ss_store_sk
>                       group by s_state
>                      ) tmp1 
>                where ranking <= 5
>              )
>  group by s_state,s_county with rollup
> order by
>    lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to