RE: [FLINK-4541] Support for SQL NOT IN operator

Alexander Shoshin Fri, 11 Nov 2016 01:43:39 -0800

Yes, I would like to continue my work with this issue. It is already assigned 
to me. I will follow the approach that you suggested.


Thanks,
Alexander

> -----Original Message-----
> From: Fabian Hueske [mailto:fhue...@gmail.com] 
> Sent: Friday, November 11, 2016 11:49 AM
> To: dev@flink.apache.org
> Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
>
> Hi Alexander,
>
> I had a look at the logical plan generated by Calcite for NOT IN again
> and noticed that the CROSS JOIN is only used to join a single record
> (the result of a global aggregation) to all records of the other table.
> IMO, this is a special case, that we can easily support and efficiently 
> implement.
>
> For that I would implement a translation rule, that converts a Cross Join
> (join(true)) only into a DataSetSingleRowCross if one input of the join
> is a global aggregation, i.e., a LogicalAggregate without groupset / 
> groupsets.
> The DataSetSingleRowCross would be implemented as MapFunction
> with BroadcastSet input (for the single record input) and internally
> call a code generated JoinFunction.
>
> If you want to follow that approach, please add a summary of our
> discussion to the JIRA issue and assign it to yourself (if that doesn't work,
> please let me know and I'll give you contributor permissions).
>
> Thanks,
> Fabian
>
> 2016-11-11 9:28 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com>:
>
> > Hi Fabian,
> >
> > Should we close this issue then? Or I could just leave the comment why 
> > we can't repair NOT IN at the moment. So no one else will do the same 
> > research again. Perhaps the Calcite team will change a logical plan 
> > for NOT IN and we will be back to this issue.
> >
> > Regards,
> > Alexander
> >
> > -----Original Message-----
> > From: Fabian Hueske [mailto:fhue...@gmail.com]
> > Sent: Friday, November 11, 2016 2:04 AM
> > To: dev@flink.apache.org
> > Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
> >
> > > Hi Alexander,
> > >
> > > Thanks for looking into this issue!
> > >
> > > We did not support CROSS JOIN on purpose because the general case is
> > > very expensive to compute.
> > > Also as you noticed we would have to make sure that inner-joins are
> > > preferred over cross joins (if possible).
> > > Cost-based optimizers (such as Calcite's Volcano Planner) use cost
> > > estimates for that.
> > > So in principle, the approach of assigning high costs was correct. 
> > > In
> > > the case of Flink,
> > > we have to be careful though, because we do not have good estimates 
> > > for
> > > the size of
> > > our inputs (actually we do not have any at the moment...). So even
> > > tweaked cost functions might bit us.
> > >
> > > I would actually prefer to not add support for NOT IN (except for 
> > > the
> > > simple case of a list of a few literals) at this point.
> > > Adding a CrossJoin translation rule will also add support for 
> > > arbitrary
> > > cross joins, which I would like to avoid.
> > > With support for cross joins it is possible to write by accident 
> > > queries
> > > that run for days and produce
> > > vast amounts of data filling up all disks.
> > >
> > > I also think there are other issues, which are more important to 
> > > address
> > > and would have a bigger impact than support for NOT IN.
> > >
> > > Best, Fabian
> > >
> > > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin 
> > ><alexander_shos...@epam.com
> > >:
> > >
> > > > Hi,
> > > >
> > > > I am working on FLINK-4541 issue and this is my current changes:
> > > > https://github.com/apache/flink/compare/master...
> > > > AlexanderShoshin:FLINK-4541.
> > > >
> > > > I found that NOT IN does not work with nested queries because of 
> > > > missed DataSet planner rule for a cross join. After adding 
> > > > DataSetCrossJoinRule several tests from 
> > > > org.apache.flink.api.scala.batch.ExplainTest
> > > > (testJoinWithExtended and testJoinWithoutExtended) that should 
> > > > check execution plans became fail because VolcanoPlanner started 
> > > > to build new execution plans for them using DataSetCrossJoin. 
> > > > That's why I increased DataSetCrossJoin cost (in 
> > > > computeSelfCost(...) function) to avoid its usage if it is possible. 
> > > > But it seems to be not a good idea.
> > > >
> > > > Do you have any ideas how to solve a problem with 
> > > > testJoinWithExtended and testJoinWithoutExtended tests in another 
> > > > way? Is it correct that these tests check an execution plan instead of 
> > > > a query result?
> > > >
> > > > Regards,
> > > > Alexander
> > > >
> > > >

RE: [FLINK-4541] Support for SQL NOT IN operator

Reply via email to