Vineet, In case you forgot, can you please log that JIRA case? If we have a lengthy design discussion without creating an action item, we are wasting everyone’s time.
Julian > On Nov 1, 2016, at 11:00 AM, Julian Hyde <[email protected]> wrote: > > Alexander & Vineet, > > One further comment about “NOT IN”. SQL in general is fairly close to > relational algebra, but “NOT IN” is one of the places where the gap is > widest. “NOT IN” is difficult in general to execute efficiently, because of > the problem of NULL values (at Oracle, we always recommended to users to > rewrite as NOT EXISTS if possible). The gap between SQL and relational > algebra is apparent when a short SQL query becomes a complex RelNode tree. > > There is a silver lining: the RelNode tree, being relational algebra, has > well-behaved semantics. Once you’re in RelNode land, you can freely apply > transformation rules to make it efficient. > > Vineet, > > If the planner rules produce a plan that is sub-optimal I wouldn’t call it a > bug but a missing feature. (It would be a bug if the planner over-reached and > created a plan that gave wrong results, so I always tend to be conservative > about adding rules.) > > Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few > redundant projects and filters are no problem, and can be easily removed > later with rules that reduce constants and propagate predicates throughout > the tree. But for the general case of NOT IN, we have to add a self-join to > deal with the possibility that the key has NULL values. After constant > reduction has kicked in and we have realized that NULL key values are not > possible, it is not easy to remove that self-join. > > Here is a very simple query where this happens: > > sqlline> !connect > jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa "" > sqlline> !set outputformat csv > sqlline> explain plan for select * from scott.emp where deptno not in ( >> select deptno from scott.dept where deptno = 20); > 'PLAN' > 'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], > expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS > NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, > $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], > proj#0..7=[{exprs}], $condition=[$t21]) > EnumerableJoin(condition=[=($7, $10)], joinType=[left]) > EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], > MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], > ck=[$t1]) > EnumerableJoin(condition=[true], joinType=[inner]) > JdbcToEnumerableConverter > JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)]) > JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)]) > JdbcTableScan(table=[[SCOTT, DEPT]]) > JdbcToEnumerableConverter > JdbcTableScan(table=[[SCOTT, EMP]]) > JdbcToEnumerableConverter > JdbcAggregate(group=[{0, 1}]) > JdbcProject(DEPTNO=[$0], i=[true]) > JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)]) > JdbcTableScan(table=[[SCOTT, DEPT]]) > ' > 1 row selected (0.067 seconds) > > Note that there are two scans of DEPT, but one is sufficient because DEPTNO > is never null. In the JdbcAggregate, c always equals ck, and therefore the > CASE can be simplified, and therefore the scan of DEPT that produces c and ck > can be dropped, but Calcite rules cannot deduce that fact. > > Can you please log a JIRA case for this? See if you can find some other > queries (maybe using IN rather than NOT IN, or whose key columns are not so > obviously NOT NULL) and include these in the JIRA case also. > > I doubt we can fix using a planner rule. The best solution may be to use > RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add > the join. > > Julian > > >> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <[email protected]> >> wrote: >> >> Julian, thank you for help. >> >> I had a wrong picture of NULL values processing. So, it looks like there is >> some problem in my planner rules. >> As for the AST, I was confused by the wrong Flink "explain()" function >> description :) >> >> >> Regards, >> Alexander >> >> -----Original Message----- >> From: Julian Hyde [mailto:[email protected]] >> Sent: Monday, October 31, 2016 10:43 PM >> To: [email protected] >> Subject: Re: Problems with abstract syntax tree >> >> The behavior of NOT IN in SQL is complicated when there are NULL values >> around. In particular, if one "word" value from the sub-query is null, then >> the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' >> null)" would evaluate to UNKNOWN for every row.) >> >> It is valid to deduce that "word" in the sub-query is never null, because of >> the "WHERE word = 'hello'" condition. I would have hoped that a constant >> reduction could do that, and then maybe the CASE expression can be >> simplified. >> >> By the way, to be pedantic, what we are talking about here is the RelNode >> tree, the relational algebra, which comes out of the SqlToRelConverter. The >> AST is the SqlNode tree, which comes out of the parser and goes into the >> SqlToRelConverter. >> >> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin >> <[email protected]> wrote: >>> Hello, everybody. >>> >>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. >>> Can you help me to understand is there a problem in Calcite or just in >>> wrong settings passed to Calcite functions? >>> >>> I have a simple table "Words" with one column named "word" and a query with >>> NOT IN operator: >>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM >>> Words WHERE word = 'hello')" >>> >>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() >>> and then transformed to a relational tree by >>> org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...). >>> >>> As a result I see the following abstract syntax tree >>> LogicalProject(word=[$0]) >>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, >>> IS NULL($3), null, <($2, $1), null, false))]) >>> LogicalJoin(condition=[=($3, $4)], joinType=[left]) >>> LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0]) >>> LogicalJoin(condition=[true], joinType=[inner]) >>> EnumerableTableScan(table=[[Words]]) >>> LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)]) >>> LogicalProject($f0=[$0], $f1=[true]) >>> LogicalProject(word=[$0]) >>> LogicalFilter(condition=[=($0, 'hello')]) >>> EnumerableTableScan(table=[[Words]]) >>> LogicalAggregate(group=[{0}], agg#0=[MIN($1)]) >>> LogicalProject($f0=[$0], $f1=[true]) >>> LogicalProject(word=[$0]) >>> LogicalFilter(condition=[=($0, 'hello')]) >>> EnumerableTableScan(table=[[Words]]) >>> >>> which fails later during query plan optimization (while calling >>> org.apache.calcite.tools.Programs.RuleSetProgram.run()). >>> >>> I think it might be because of a very complex abstract syntax tree >>> generated by Calcite. Shouldn't it be more simple? This one looks good for >>> me: >>> LogicalProject(word=[$0]) >>> LogicalFilter(condition=[IS NULL($2)]) >>> LogicalJoin(condition=[=($0, $1)], joinType=[left]) >>> EnumerableTableScan(table=[[Words]]) >>> LogicalProject($f0=[$0], $f1=[true]) >>> LogicalProject(word=[$0]) >>> LogicalFilter(condition=[=($0, 'hello')]) >>> EnumerableTableScan(table=[[Words]]) >>> >>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree >>> - the optimization works fine. And the query execution result is the same >>> as must be in case of using NOT IN. So am I wrong with a supposition about >>> bad abstract syntax tree or not? I will be glad to receive any comments. >>> >>> Regards, >>> Alexander >
