Yes, I would like to continue my work with this issue. It is already assigned to me. I will follow the approach that you suggested.
Thanks, Alexander > -----Original Message----- > From: Fabian Hueske [mailto:fhue...@gmail.com] > Sent: Friday, November 11, 2016 11:49 AM > To: dev@flink.apache.org > Subject: Re: [FLINK-4541] Support for SQL NOT IN operator > > Hi Alexander, > > I had a look at the logical plan generated by Calcite for NOT IN again > and noticed that the CROSS JOIN is only used to join a single record > (the result of a global aggregation) to all records of the other table. > IMO, this is a special case, that we can easily support and efficiently > implement. > > For that I would implement a translation rule, that converts a Cross Join > (join(true)) only into a DataSetSingleRowCross if one input of the join > is a global aggregation, i.e., a LogicalAggregate without groupset / > groupsets. > The DataSetSingleRowCross would be implemented as MapFunction > with BroadcastSet input (for the single record input) and internally > call a code generated JoinFunction. > > If you want to follow that approach, please add a summary of our > discussion to the JIRA issue and assign it to yourself (if that doesn't work, > please let me know and I'll give you contributor permissions). > > Thanks, > Fabian > > 2016-11-11 9:28 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com>: > > > Hi Fabian, > > > > Should we close this issue then? Or I could just leave the comment why > > we can't repair NOT IN at the moment. So no one else will do the same > > research again. Perhaps the Calcite team will change a logical plan > > for NOT IN and we will be back to this issue. > > > > Regards, > > Alexander > > > > -----Original Message----- > > From: Fabian Hueske [mailto:fhue...@gmail.com] > > Sent: Friday, November 11, 2016 2:04 AM > > To: dev@flink.apache.org > > Subject: Re: [FLINK-4541] Support for SQL NOT IN operator > > > > > Hi Alexander, > > > > > > Thanks for looking into this issue! > > > > > > We did not support CROSS JOIN on purpose because the general case is > > > very expensive to compute. > > > Also as you noticed we would have to make sure that inner-joins are > > > preferred over cross joins (if possible). > > > Cost-based optimizers (such as Calcite's Volcano Planner) use cost > > > estimates for that. > > > So in principle, the approach of assigning high costs was correct. > > > In > > > the case of Flink, > > > we have to be careful though, because we do not have good estimates > > > for > > > the size of > > > our inputs (actually we do not have any at the moment...). So even > > > tweaked cost functions might bit us. > > > > > > I would actually prefer to not add support for NOT IN (except for > > > the > > > simple case of a list of a few literals) at this point. > > > Adding a CrossJoin translation rule will also add support for > > > arbitrary > > > cross joins, which I would like to avoid. > > > With support for cross joins it is possible to write by accident > > > queries > > > that run for days and produce > > > vast amounts of data filling up all disks. > > > > > > I also think there are other issues, which are more important to > > > address > > > and would have a bigger impact than support for NOT IN. > > > > > > Best, Fabian > > > > > > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin > > ><alexander_shos...@epam.com > > >: > > > > > > > Hi, > > > > > > > > I am working on FLINK-4541 issue and this is my current changes: > > > > https://github.com/apache/flink/compare/master... > > > > AlexanderShoshin:FLINK-4541. > > > > > > > > I found that NOT IN does not work with nested queries because of > > > > missed DataSet planner rule for a cross join. After adding > > > > DataSetCrossJoinRule several tests from > > > > org.apache.flink.api.scala.batch.ExplainTest > > > > (testJoinWithExtended and testJoinWithoutExtended) that should > > > > check execution plans became fail because VolcanoPlanner started > > > > to build new execution plans for them using DataSetCrossJoin. > > > > That's why I increased DataSetCrossJoin cost (in > > > > computeSelfCost(...) function) to avoid its usage if it is possible. > > > > But it seems to be not a good idea. > > > > > > > > Do you have any ideas how to solve a problem with > > > > testJoinWithExtended and testJoinWithoutExtended tests in another > > > > way? Is it correct that these tests check an execution plan instead of > > > > a query result? > > > > > > > > Regards, > > > > Alexander > > > > > > > >