You’re right that bloom filters are useful. I was just exploring what could be done at the logical level; when it comes to implementing the semi-join, bloom filters are a good option, if you can accept an approximate answer.
Here’s a scenario where it would make sense to transform JoinRel(X, Y) —> JoinRel(SemiJoinRel(X, Y), Y). Let’s suppose that Y has a large number of rows and columns (i.e. the average row length is large). We can ship the set of distinct Y key values to X, semi-join them, then send the filtered X rows to Y. So, SemiJoin(X, Y) has significantly lower I/O cost than Join(X, Y) even though it reads the same number of rows from X and Y, because it reads fewer columns from Y. We’ve replaced one shuffle join with two map joins. Julian
