[
https://issues.apache.org/jira/browse/IMPALA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-13225:
-------------------------------------
Labels: performance (was: )
> Tuple deduplication does not work in partitioned exchanges
> ----------------------------------------------------------
>
> Key: IMPALA-13225
> URL: https://issues.apache.org/jira/browse/IMPALA-13225
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: performance
>
> RowBatch::Serialize() has a deduplication logic that detects duplicate tuples
> (usually the result of joins) based on tuple pointers. This doesn't work in
> partitioned exchanges because all rows are deep copied one-by-one when
> collecting rows for a given channel, so all tuple pointers will be distinct:
> https://github.com/apache/impala/blob/d83b48cf72fa94ec7f6e55da409b4dff3350543b/be/src/runtime/krpc-data-stream-sender.cc#L645
> The deduplication was added a long time ago (doesn't have a Jira):
> https://gerrit.cloudera.org/#/c/573/
> I am not sure if it ever worked in the partitioned case (it should work
> though in broadcast exchanges).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]