Re: Outer-join operator integration with DataSet API (FLINK-2576)

Till Rohrmann Tue, 01 Sep 2015 09:35:45 -0700

Hi Johann,

I'd prefer 1.c, because the different join variants are semantically
different and this should be IMO reflected in the API. Moreover, the
`JoinHints` are used to give hints for the selection of the underlying
strategy for the different join variants. For `leftOuterJoin` you could
either use a sort-merge-join or a hash-join where the left side is the
probe side. For the `rightOuterJoin` you will have different strategies:
Either you use a sort-merge-join or a hash-join where the right side is the
probe side. Thus, the `JoinHints` for both variants cannot be the same.


Since I'm not too familiar with the code base of the join operator
implementation, I cannot give you a really good advice for question 2. In
general it would be good to avoid as much redundant code as possible since
it makes the code base harder to maintain. But if this should entail a
disproportional larger work effort, then I think it's also fine to follow
the CoGroup way.

Cheers,
Till

On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <[email protected]> wrote:

> Hi all,
>
> we (Ricky and I) are currently working on the outer join
> implementation for Flink (FLINK-687, previous pull requests #907,
> #1052).
>
> I am now looking for advice on 2 issues specifically regarding the
> integration of the outer join operator with the DataSet API
> (FLINK-2576).
>
> 1. There are several options of exposing the operator to the user via
> the DataSet API and I'd just like to hear your preferences between the
> following options (or other suggestions if I missed something):
>   a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
> [i.e. asking the user to pass an enum left-, right-, or full outer
> join]
>   b. DataSet#join(DataSet other, JoinType joinType)  [i.e. like option
> a, but generalized to work for all: inner-, left-, right-, full outer
> joins]
>   c. DataSet#left/right/fullOuterJoin(DataSet other)  [i.e. a fully
> qualified method for each operator]
>
> Personally I'm partial towards options a and c, although a does have
> the advantage of not blowing up the API too much (imagine adding
> additional optional parameters, such as JoinHint, to each of option
> c's methods).
>
> 2. I would have liked to implement the outer join operator API by
> reusing as much code & functionality as possible from
> org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
> (especially all the KeySelector, semantic annotations, and tuple
> unwrapping stuff...) but I feel like this would bite me sooner or
> later due to incompatibilities or other minor differences between the
> behaviour of those operators.
> I imagine this is the reason why lots of this functionality was
> duplicated for the CoGroup operator implementation. Which makes me
> think I should probably go the same route and duplicate the necessary
> APIs, and then maybe try to refactor later?
> Any opinions or hints regarding this?
>
> Thanks in advance,
> Johann
>

Re: Outer-join operator integration with DataSet API (FLINK-2576)

Reply via email to