That makes sense.
In https://issues.apache.org/jira/browse/OPTIQ-369, I have implemented
EnumerableSemiJoinRel. An example (from misc.oq):
select * from "hr"."emps"
where exists (
select 1 from "hr"."depts" where "depts"."deptno" = "emps"."deptno");
+-------+--------+-----------+---------+------------+
| empid | deptno | name | salary | commission |
+-------+--------+-----------+---------+------------+
| 100 | 10 | Bill | 10000.0 | 1000 |
| 110 | 10 | Theodore | 11500.0 | 250 |
| 150 | 10 | Sebastian | 7000.0 | |
+-------+--------+-----------+---------+------------+
(3 rows)
!ok
EnumerableSemiJoinRel(condition=[=($1, $5)], joinType=[inner])
EnumerableTableAccessRel(table=[[hr, emps]])
EnumerableCalcRel(expr#0..3=[{inputs}], expr#4=[true], $f01=[$t0], $f0=[$t4])
EnumerableJoinRel(condition=[=($0, $1)], joinType=[inner])
EnumerableAggregateRel(group=[{0}])
EnumerableCalcRel(expr#0..4=[{inputs}], $f0=[$t1])
EnumerableTableAccessRel(table=[[hr, emps]])
EnumerableTableAccessRel(table=[[hr, depts]])
The generated code uses the new Enumerables.semiJoin function, which builds a
HashSet of the keys the “inner” side (depts.deptno) then probes using a stream
from the “outer” side (emps).
The other use of “depts” in the query plan is as a “value generator”. As
https://issues.apache.org/jira/browse/OPTIQ-375 notes, the value generator is
often not necessary.
Vladimir,
Can you please log a JIRA case for your proposed “reverse” implementation of
semi-join where you build the non-semi-join side? It would help if you add a
cost function that would help us decide which implementation to use.
Julian
On Aug 11, 2014, at 1:48 PM, Vladimir Sitnikov <[email protected]>
wrote:
>> Filter(Scan(emp), gender = ‘F’) is the “build” side, and you can imagine
> implementing it by populating a hash-table with distinct keys before
> starting to read from the “dept” table.
>
> There is more than one way to skin a hash semi-join.
>
> Imagine the following physical plan:
> 1) Scan(dept) first and build Map<deptno, List<dept.row>>
> 2) Perform Filter(Scan(emp), gender = ‘F’) and check if the join key is in
> memory. If found, remove the entry from the map and return rows. If not
> found, just ignore the row.
>
> This can be more efficient in case dept is smaller than Filter(Scan(emp),
> gender = ‘F’).
>
> I suggest the following:
> 1) Let the right side of the semi join be the one that is semi-joined (as
> in your example above)
> 2) "build" side (in your terms above) is more of a physical plan, thus it
> is irrelevant at the logical level (see below). For example, for enumerable
> convention we could have EnumerableHashJoinSemi (builds table over left
> relation) and EnumerableHashJoinSemiRight (builds table over right
> relation).
>
> Vladimir