[ 
https://issues.apache.org/jira/browse/IMPALA-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boglarka Egyed updated IMPALA-12649:
------------------------------------
    Summary: Use max(data_sequence_number) for joining equality delete rows  
(was: Use max(data_sequence_number) fo joining equality delete rows)

> Use max(data_sequence_number) for joining equality delete rows
> --------------------------------------------------------------
>
>                 Key: IMPALA-12649
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12649
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>            Reporter: Gabor Kaszab
>            Priority: Major
>              Labels: impala-iceberg
>
> improvement idea for the future:
> If Flink always writes EQ-delete files, and uses the same primary key a lot, 
> we will have the same entry in the HashMap with multiple data sequence 
> numbers. Then during probing, for each hash table lookup we need to loop over 
> all the sequence numbers and check them. Actually we only need the largest 
> data sequence number, the lower sequence numbers with the same primary keys 
> don't add any value.
> So we could add an Aggregation node to the right side of the join, like "PK1, 
> PK2, ..., max(data_sequence_number), group by PK1, PK2, ...".
> Now, we would need to decide when to add this node to the plan, or when we 
> shouldn't. We should also avoid having an EXCHANGE between the aggregation 
> node and the JOIN node, as it would be redundant as they would use the same 
> partition key expressions (the primary keys).
> If we had "hash teams" in Impala, we could always add this aggregator 
> operator, as it would be in the same "hash team" with the JOIN operator, i.e. 
> we wouldn't need to build the hash table twice. Microsoft's paper about hash 
> joins and hash teams: 
> [https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=fc1c78cbef5062cf49fdb309b1935af08b759d2d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to