[
https://issues.apache.org/jira/browse/IMPALA-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong reassigned IMPALA-4969:
-------------------------------------
Assignee: (was: Alexander Behm)
> With clustered hint, consider sort->exhchange->insert plan
> ----------------------------------------------------------
>
> Key: IMPALA-4969
> URL: https://issues.apache.org/jira/browse/IMPALA-4969
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.9.0
> Reporter: Alan Choi
> Priority: Major
>
> I noticed that with the clustered hint, we do the SORT right before the
> insert, but it's after the exchange (when shuffling).
> For a simple ETL transformation (insert into tbl select * from src_tbl), the
> number of hosts doing to write is going to less than or equal to the host
> doing the scan. So, by doing the sort after the exchange, there's a risk of
> losing parallelism.
> Using TPC-DS as an example, the Impala TPC-DS toolkit requires a ETL step
> where we re-partition the fact table according to the sales date. Sales date
> is skewed: some date has a lot more data then the other. Also, there are only
> 4k sales date. The data size might not even out across the whole cluster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]