[
https://issues.apache.org/jira/browse/IMPALA-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Noemi Pap-Takacs updated IMPALA-12995:
--------------------------------------
Issue Type: Improvement (was: Bug)
> Don't always sort data for Iceberg partitioned inserts
> ------------------------------------------------------
>
> Key: IMPALA-12995
> URL: https://issues.apache.org/jira/browse/IMPALA-12995
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
>
> Currently we always do a SHUFFLE and SORT before inserting data into a
> partitioned Iceberg table.
> With SHUFFLE, we can guarantee that each partition is assigned to a single
> sink operator, hence we can minimize the number of files being created.
> With SORT, we can write partitions one after the other, therefore we only
> need to write one file at a time (and only buffer data for that one file).
> This way we can avoid out of memory situations in the sink.
> SORT does a total ordering of the incoming records, therefore it can be very
> expensive. And when SORT needs to spill to disk, its fragment cannot receive
> incoming records, blocking the execution of the whole query cluster-wide
> (because after some time every sender is blocked on the receiving SORT
> fragment).
> In a lot of cases the SORT is not necessary because there's only a very few
> partitions assigned to each sink, especially in large clusters with high
> MT_DOP.
> During planning Impala should decide whether to add the SORT node for such
> partitioned INSERTs. Also, it should respect the optimizer hints
> CLUSTERED/NOCLUSTERED.
> https://impala.apache.org/docs/build/html/topics/impala_hints.html
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]