[ 
https://issues.apache.org/jira/browse/IMPALA-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-4969:
-------------------------------------

    Assignee:     (was: Alexander Behm)

> With clustered hint, consider sort->exhchange->insert plan
> ----------------------------------------------------------
>
>                 Key: IMPALA-4969
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4969
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.9.0
>            Reporter: Alan Choi
>            Priority: Major
>
> I noticed that with the clustered hint, we do the SORT right before the 
> insert, but it's after the exchange (when shuffling).
> For a simple ETL transformation (insert into tbl select * from src_tbl), the 
> number of hosts doing to write is going to less than or equal to the host 
> doing the scan. So, by doing the sort after the exchange, there's a risk of 
> losing parallelism.
> Using TPC-DS as an example, the Impala TPC-DS toolkit requires a ETL step 
> where we re-partition the fact table according to the sales date. Sales date 
> is skewed: some date has a lot more data then the other. Also, there are only 
> 4k sales date. The data size might not even out across the whole cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to