[
https://issues.apache.org/jira/browse/IMPALA-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong updated IMPALA-9875:
----------------------------------
Description:
For left semi and anti joins with only equi-join predicates, we don't need to
store duplicates in the hash table, because a probe row will always match the
first build row. We could rework the build process in PhjBuilder so that it
builds the hash table on the fly and avoids insertion into the
BufferedTupleStream if there is a match in the hash table. I.e. the build
process would be closer to GroupingAggregator.
An alternative approach to building the hash tables on the fly would be to use
a bloom filter to track which rows are already present in the hash table.
Some other joins like that in IMPALA-1706 also have distinct semantics, so
maybe this could be applied there too to avoid exploding joins.
was:
For left semi and anti joins with only equi-join predicates, we don't need to
store duplicates in the hash table, because a probe row will always match the
first build row. We could rework the build process in PhjBuilder so that it
builds the hash table on the fly and avoids insertion into the
BufferedTupleStream if there is a match in the hash table. I.e. the build
process would be closer to GroupingAggregator.
Some other joins like that in IMPALA-1706 also have distinct semantics, so
maybe this could be applied there too to avoid exploding joins.
> Deduplicate build in joins with distinct semantics
> --------------------------------------------------
>
> Key: IMPALA-9875
> URL: https://issues.apache.org/jira/browse/IMPALA-9875
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Priority: Major
>
> For left semi and anti joins with only equi-join predicates, we don't need to
> store duplicates in the hash table, because a probe row will always match the
> first build row. We could rework the build process in PhjBuilder so that it
> builds the hash table on the fly and avoids insertion into the
> BufferedTupleStream if there is a match in the hash table. I.e. the build
> process would be closer to GroupingAggregator.
> An alternative approach to building the hash tables on the fly would be to
> use a bloom filter to track which rows are already present in the hash table.
> Some other joins like that in IMPALA-1706 also have distinct semantics, so
> maybe this could be applied there too to avoid exploding joins.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]