Hi,
Looks like you are joining store_sales with catalog_sales on item_sk, this
kind of join condition is a many to many, which means the output number of
rows will be much larger then input number of rows, not sure if this is
intended.
Also did you run "compute stats [TABLE_NAME]" on both tables?
For a more comprehensive query try TPCDS Q17
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,avg(ss_quantity) as store_sales_quantityave
,stddev_samp(ss_quantity) as store_sales_quantitystdev
,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov
,count(sr_return_quantity) as store_returns_quantitycount
,avg(sr_return_quantity) as store_returns_quantityave
,stddev_samp(sr_return_quantity) as store_returns_quantitystdev
,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
store_returns_quantitycov
,count(cs_quantity) as catalog_sales_quantitycount
,avg(cs_quantity) as catalog_sales_quantityave
,stddev_samp(cs_quantity) as catalog_sales_quantitystdev
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov
from store_sales
,store_returns
,catalog_sales
,date_dim d1
,date_dim d2
,date_dim d3
,store
,item
where d1.d_quarter_name = '2000Q1'
and d1.d_date_sk = ss_sold_date_sk
and i_item_sk = ss_item_sk
and s_store_sk = ss_store_sk
and ss_customer_sk = sr_customer_sk
and ss_item_sk = sr_item_sk
and ss_ticket_number = sr_ticket_number
and sr_returned_date_sk = d2.d_date_sk
and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
and sr_customer_sk = cs_bill_customer_sk
and sr_item_sk = cs_item_sk
and cs_sold_date_sk = d3.d_date_sk
and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
group by i_item_id
,i_item_desc
,s_state
order by i_item_id
,i_item_desc
,s_state
limit 100;
I recommend moving this kind of discussion on
[email protected].
On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote:
> The profile file is damaged. Here is a screenshot for exec summary
>
>
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>:
>
>> Hi Devs
>>
>> I met a performance issue on big table join. The query takes more than 3
>> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> cluster. when running query, the left scanner and exchange node are very
>> slow. Did I miss some key arguments?
>>
>> you can see profile file in attachment.
>>
>>
>>
>> --
>> Thanks & Best Regards
>>
>
>
>
> --
> Thanks & Best Regards
>