[
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162687#comment-16162687
]
liyunzhang_intel commented on HIVE-17474:
-----------------------------------------
after debugging code, i found part2 join part1 is a map join in tez, this is
the difference with hive on spark.Will update the detail reason later.
> Poor Performance about subquery like DS/query70 on HoS
> ------------------------------------------------------
>
> Key: HIVE-17474
> URL: https://issues.apache.org/jira/browse/HIVE-17474
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
> Attachments: explain.70.vec
>
>
> in
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
> {code}
> select
> sum(ss_net_profit) as total_sum
> ,s_state
> ,s_county
> ,grouping__id as lochierarchy
> , rank() over(partition by grouping__id, case when grouping__id == 2 then
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
> store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
> join store s on s.s_store_sk = ss.ss_store_sk
> where
> d1.d_month_seq between 1193 and 1193+11
> and s.s_state in
> ( select s_state
> from (select s_state as s_state, sum(ss_net_profit),
> rank() over ( partition by s_state order by
> sum(ss_net_profit) desc) as ranking
> from store_sales, store, date_dim
> where d_month_seq between 1193 and 1193+11
> and date_dim.d_date_sk =
> store_sales.ss_sold_date_sk
> and store.s_store_sk = store_sales.ss_store_sk
> group by s_state
> ) tmp1
> where ranking <= 5
> )
> group by s_state,s_county with rollup
> order by
> lochierarchy desc
> ,case when lochierarchy = 0 then s_state end
> ,rank_within_parent
> limit 100;
> {code}
> let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1
> and part2 is low as there are not very different values about states(
> actually there are 30 different values in the table store). If use common
> join, big data will go to the 30 reducers.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)