[ 
https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17474:
------------------------------------
    Description: 
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 {code}
select  
    sum(ss_net_profit) as total_sum
   ,s_state
   ,s_county
   ,grouping__id as lochierarchy
   , rank() over(partition by grouping__id, case when grouping__id == 2 then 
s_state end order by sum(ss_net_profit)) as rank_within_parent
from
    store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
    join store s on s.s_store_sk  = ss.ss_store_sk
 where
    d1.d_month_seq between 1193 and 1193+11
 and s.s_state in
             ( select s_state
               from  (select s_state as s_state, sum(ss_net_profit),
                             rank() over ( partition by s_state order by 
sum(ss_net_profit) desc) as ranking
                      from   store_sales, store, date_dim
                      where  d_month_seq between 1193 and 1193+11
                            and date_dim.d_date_sk = store_sales.ss_sold_date_sk
                            and store.s_store_sk  = store_sales.ss_store_sk
                      group by s_state
                     ) tmp1 
               where ranking <= 5
             )
 group by s_state,s_county with rollup
order by
   lochierarchy desc
  ,case when lochierarchy = 0 then s_state end
  ,rank_within_parent
 limit 100;
{code}
 let's analyze the query,
part1: it calculates the sub-query and get the result of the state which 
ss_net_profit is less than 5.
part2: big table store_sales join small tables date_dim, store and get the 
result.
part3: part1 join part2
the problem is on the part3, this is common join. The cardinality of part1 and 
part2 is low as there are not very different values about states( actually 
there are 30 different values in the table store).  If use common join, big 
data will go to the 30 reducers.

  was:
in 
[DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
 The explain of hive on spark is
{code}


{code}


> Different logical plan of same query(TPC-DS/70) with same settings
> ------------------------------------------------------------------
>
>                 Key: HIVE-17474
>                 URL: https://issues.apache.org/jira/browse/HIVE-17474
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> in 
> [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql].
>  {code}
> select  
>     sum(ss_net_profit) as total_sum
>    ,s_state
>    ,s_county
>    ,grouping__id as lochierarchy
>    , rank() over(partition by grouping__id, case when grouping__id == 2 then 
> s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
>     store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
>     join store s on s.s_store_sk  = ss.ss_store_sk
>  where
>     d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>              ( select s_state
>                from  (select s_state as s_state, sum(ss_net_profit),
>                              rank() over ( partition by s_state order by 
> sum(ss_net_profit) desc) as ranking
>                       from   store_sales, store, date_dim
>                       where  d_month_seq between 1193 and 1193+11
>                             and date_dim.d_date_sk = 
> store_sales.ss_sold_date_sk
>                             and store.s_store_sk  = store_sales.ss_store_sk
>                       group by s_state
>                      ) tmp1 
>                where ranking <= 5
>              )
>  group by s_state,s_county with rollup
> order by
>    lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which 
> ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the 
> result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 
> and part2 is low as there are not very different values about states( 
> actually there are 30 different values in the table store).  If use common 
> join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to