[
https://issues.apache.org/jira/browse/IMPALA-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boglarka Egyed updated IMPALA-12649:
------------------------------------
Summary: Use max(data_sequence_number) for joining equality delete rows
(was: Use max(data_sequence_number) fo joining equality delete rows)
> Use max(data_sequence_number) for joining equality delete rows
> --------------------------------------------------------------
>
> Key: IMPALA-12649
> URL: https://issues.apache.org/jira/browse/IMPALA-12649
> Project: IMPALA
> Issue Type: Sub-task
> Components: Frontend
> Reporter: Gabor Kaszab
> Priority: Major
> Labels: impala-iceberg
>
> improvement idea for the future:
> If Flink always writes EQ-delete files, and uses the same primary key a lot,
> we will have the same entry in the HashMap with multiple data sequence
> numbers. Then during probing, for each hash table lookup we need to loop over
> all the sequence numbers and check them. Actually we only need the largest
> data sequence number, the lower sequence numbers with the same primary keys
> don't add any value.
> So we could add an Aggregation node to the right side of the join, like "PK1,
> PK2, ..., max(data_sequence_number), group by PK1, PK2, ...".
> Now, we would need to decide when to add this node to the plan, or when we
> shouldn't. We should also avoid having an EXCHANGE between the aggregation
> node and the JOIN node, as it would be redundant as they would use the same
> partition key expressions (the primary keys).
> If we had "hash teams" in Impala, we could always add this aggregator
> operator, as it would be in the same "hash team" with the JOIN operator, i.e.
> we wouldn't need to build the hash table twice. Microsoft's paper about hash
> joins and hash teams:
> [https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=fc1c78cbef5062cf49fdb309b1935af08b759d2d]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]