[
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163762#comment-16163762
]
liyunzhang_intel commented on HIVE-17474:
-----------------------------------------
[~lirui] , [~xuefuz]: after debugging in tez, found the part2 join part1 is
common merge join(CommonMergeJoinOperator).
{code}
Reducer 2
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col7 (type: string)
1 _col0 (type: string)
{code}
the implementation of CommonMergeJoin. Does hive on spark enable
CommonMergeJoin?
{code}
/*
* With an aim to consolidate the join algorithms to either hash based joins
(MapJoinOperator) or
* sort-merge based joins, this operator is being introduced. This operator
executes a sort-merge
* based algorithm. It replaces both the JoinOperator and the
SMBMapJoinOperator for the tez side of
* things. It works in either the map phase or reduce phase.
*
* The basic algorithm is as follows:
*
* 1. The processOp receives a row from a "big" table.
* 2. In order to process it, the operator does a fetch for rows from the other
tables.
* 3. Once we have a set of rows from the other tables (till we hit a new key),
more rows are
* brought in from the big table and a join is performed.
*/
{code}
> Poor Performance about subquery like DS/query70 on HoS
> ------------------------------------------------------
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
> {code}
> select
> sum(ss_net_profit) as total_sum
> ,s_state
> ,s_county
> ,grouping__id as lochierarchy
> , rank() over(partition by grouping__id, case when grouping__id == 2 then
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk = ss.ss_store_sk
> where
> d1.d_month_seq between 1193 and 1193+11
> and s.s_state in
> ( select s_state
> from (select s_state as s_state, sum(ss_net_profit),
> rank() over ( partition by s_state order by
> sum(ss_net_profit) desc) as ranking
> from store_sales, store, date_dim
> where d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk =
> store_sales.ss_sold_date_sk
> and store.s_store_sk = store_sales.ss_store_sk
> group by s_state
> ) tmp1
> where ranking <= 5
> )
> group by s_state,s_county with rollup
> order by
> lochierarchy desc
> ,case when lochierarchy = 0 then s_state end
> ,rank_within_parent
> limit 100;
> {code}
> let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1
> and part2 is low as there are not very different values about states(
> actually there are 30 different values in the table store). If use common
> join, big data will go to the 30 reducers.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)