[
https://issues.apache.org/jira/browse/DRILL-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacques Nadeau updated DRILL-173:
---------------------------------
Issue Type: Bug (was: Improvement)
> Join operator should reuse ValueVectors when duplicate keys are present
> -----------------------------------------------------------------------
>
> Key: DRILL-173
> URL: https://issues.apache.org/jira/browse/DRILL-173
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.0.0-milestone-1
> Reporter: Ben Becker
> Labels: optimization
> Fix For: Future
>
>
> There are cases where joining two record batches can result in redundant
> work. Consider a merge join performed on two tables (*t1* and *t2*) with
> duplicate keys on both sides:
> h5. t1
> || key || value ||
> | 2 | 'a' |
> | 2 | 'b' |
> h5. t2
> || key || value ||
> | 2 | 'A' |
> | 2 | 'B' |
> | 2 | 'C' |
> The resulting table will contain the cross product of all key values '2':
> || key || t1.value || t2.value ||
> | 2 | 'a' | 'A' |
> | 2 | 'a' | 'B' |
> | 2 | 'a' | 'C' |
> | 2 | 'b' | 'A' |
> | 2 | 'b' | 'B' |
> | 2 | 'b' | 'C' |
> The current implementation iteratively copies t2.value from the incoming
> vectors. Ideally, the t2.value vector would only be iteratively constructed
> the first pass; after that it can be copied.
--
This message was sent by Atlassian JIRA
(v6.2#6252)