[
https://issues.apache.org/jira/browse/IMPALA-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836288#comment-17836288
]
David Rorke commented on IMPALA-12995:
--------------------------------------
A couple related ideas we could investigate to allow us to buffer/write a
limited number of files at a time during partitioned inserts without requiring
the SORT:
* Use hash-based bucketing to roughly cluster based on partition keys and
write out buckets as we reach memory/buffering thresholds. If partition counts
are reasonably low you could accumulate enough data for a given set of keys to
write out large parquet row groups before running out of buffer space - if not
we might still need to do some spilling to avoid writing small files but still
likely to be faster than full sort.
* We could do partition based scanning from the outset for queries that are
ultimately going to do partitioned inserts, ensuring a given partition is
always scanned by a single thread and also processing one partition at a time
within a given scanner thread. So the partition keys would be clustered from
the outset, and if we could avoid disrupting the ordering during any
intermediate processing the rows would arrive at the writer properly ordered
without requiring additional sorting. It's not clear how much the ordered
scans would impact scan performance but seems likely it would be less expensive
than the current sort.
> Don't always sort data for Iceberg partitioned inserts
> ------------------------------------------------------
>
> Key: IMPALA-12995
> URL: https://issues.apache.org/jira/browse/IMPALA-12995
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
>
> Currently we always do a SHUFFLE and SORT before inserting data into a
> partitioned Iceberg table.
> With SHUFFLE, we can guarantee that each partition is assigned to a single
> sink operator, hence we can minimize the number of files being created.
> With SORT, we can write partitions one after the other, therefore we only
> need to write one file at a time (and only buffer data for that one file).
> This way we can avoid out of memory situations in the sink.
> SORT does a total ordering of the incoming records, therefore it can be very
> expensive. And when SORT needs to spill to disk, its fragment cannot receive
> incoming records, blocking the execution of the whole query cluster-wide
> (because after some time every sender is blocked on the receiving SORT
> fragment).
> In a lot of cases the SORT is not necessary because there's only a very few
> partitions assigned to each sink, especially in large clusters with high
> MT_DOP.
> During planning Impala should decide whether to add the SORT node for such
> partitioned INSERTs. Also, it should respect the optimizer hints
> CLUSTERED/NOCLUSTERED.
> https://impala.apache.org/docs/build/html/topics/impala_hints.html
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]