Csaba Ringhofer created IMPALA-13225:
----------------------------------------
Summary: Tuple deduplication does not work in partitioned exchanges
Key: IMPALA-13225
URL: https://issues.apache.org/jira/browse/IMPALA-13225
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
RowBatch::Serialize() has a deduplication logic that detects duplicate tuples
(usually the result of joins) based on tuple pointers. This doesn't work in
partitioned exchanges because all rows are deep copied one-by-one when
collecting rows for a given channel, so all tuple pointers will be distinct:
https://github.com/apache/impala/blob/d83b48cf72fa94ec7f6e55da409b4dff3350543b/be/src/runtime/krpc-data-stream-sender.cc#L645
The deduplication was added a long time ago (doesn't have a Jira):
https://gerrit.cloudera.org/#/c/573/
I am not sure if it ever worked in the partitioned case (it should work though
in broadcast exchanges).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]