Hi all, we (Ricky and I) are currently working on the outer join implementation for Flink (FLINK-687, previous pull requests #907, #1052).
I am now looking for advice on 2 issues specifically regarding the integration of the outer join operator with the DataSet API (FLINK-2576). 1. There are several options of exposing the operator to the user via the DataSet API and I'd just like to hear your preferences between the following options (or other suggestions if I missed something): a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType) [i.e. asking the user to pass an enum left-, right-, or full outer join] b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option a, but generalized to work for all: inner-, left-, right-, full outer joins] c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully qualified method for each operator] Personally I'm partial towards options a and c, although a does have the advantage of not blowing up the API too much (imagine adding additional optional parameters, such as JoinHint, to each of option c's methods). 2. I would have liked to implement the outer join operator API by reusing as much code & functionality as possible from org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase (especially all the KeySelector, semantic annotations, and tuple unwrapping stuff...) but I feel like this would bite me sooner or later due to incompatibilities or other minor differences between the behaviour of those operators. I imagine this is the reason why lots of this functionality was duplicated for the CoGroup operator implementation. Which makes me think I should probably go the same route and duplicate the necessary APIs, and then maybe try to refactor later? Any opinions or hints regarding this? Thanks in advance, Johann