Outer-join operator integration with DataSet API (FLINK-2576)

Johann Kovacs Tue, 01 Sep 2015 09:04:24 -0700

Hi all,

we (Ricky and I) are currently working on the outer join
implementation for Flink (FLINK-687, previous pull requests #907,
#1052).


I am now looking for advice on 2 issues specifically regarding the
integration of the outer join operator with the DataSet API
(FLINK-2576).

1. There are several options of exposing the operator to the user via
the DataSet API and I'd just like to hear your preferences between the
following options (or other suggestions if I missed something):
  a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
[i.e. asking the user to pass an enum left-, right-, or full outer
join]
  b. DataSet#join(DataSet other, JoinType joinType)  [i.e. like option
a, but generalized to work for all: inner-, left-, right-, full outer
joins]
  c. DataSet#left/right/fullOuterJoin(DataSet other)  [i.e. a fully
qualified method for each operator]

Personally I'm partial towards options a and c, although a does have
the advantage of not blowing up the API too much (imagine adding
additional optional parameters, such as JoinHint, to each of option
c's methods).

2. I would have liked to implement the outer join operator API by
reusing as much code & functionality as possible from
org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
(especially all the KeySelector, semantic annotations, and tuple
unwrapping stuff...) but I feel like this would bite me sooner or
later due to incompatibilities or other minor differences between the
behaviour of those operators.
I imagine this is the reason why lots of this functionality was
duplicated for the CoGroup operator implementation. Which makes me
think I should probably go the same route and duplicate the necessary
APIs, and then maybe try to refactor later?
Any opinions or hints regarding this?

Thanks in advance,
Johann

Outer-join operator integration with DataSet API (FLINK-2576)

Reply via email to