Exactly, that's the point. RAND() isn't pure. But Calcite doesn't know which functions are pure. I debugged the `AggChecker`, for function calls it only checks the operands - RAND() has none, so it assumes it's pure. But a function in general can depend on data other than its arguments. However from functions included in Calcite maybe only RAND is non-pure.
I hit this because in my case, our user ordered the results using a complex expression. After looking at the plan I saw it evaluated this expression for each row after table scan, even though DISTINCT greatly reduced the number of rows. If the complex order-by expression was evaluated after duplicate removal, the query would be faster, even though in my case the result set was correct. In my case evaluating the order by expressions after aggregation would improve performance. I also tried this query: SELECT DISTINCT empno, deptno + 1 FROM emp ORDER BY deptno + empno + 1 It fails with: "Expression 'DEPTNO' is not in the select clause". So `ORDER BY deptno + 1 + empno` works, but `ORDER BY deptno + emptno + 1` doesn't, even though it's an equivalent expression. The code in AggChecker doesn't handle all cases. Viliam On Tue, 3 May 2022 at 22:33, Julian Hyde <[email protected]> wrote: > I think that Postgres is a little stricter than necessary. As long as > the operations are pure, the ORDER BY clause of a SELECT DISTINCT > query should allow any expression that is functionally dependent on > the expressions in the SELECT clause. In particular, > > SELECT DISTINCT x, y > FROM t > ORDER BY x + y + 1 > > should be valid because "x + y + 1" is functionally dependent. > > A pure parameterless function, such as CURRENT_DATE, is trivially > functionally dependent. Therefore > > SELECT DISTINCT x > FROM t > ORDER BY CURRENT_DATE + x > > should also be valid. > > But RAND() isn't pure. There is an argument to ban non-pure and > non-deterministic expressions from the ORDER BY clause of a SELECT > DISTINCT query. But there's also an argument to allow them and ensure > that they are evaluated once per row. In which case the plan would be > > LogicalProject(DEPTNO=[$0]) > LogicalSort(sort0=[$1], dir0=[ASC]) > LogicalProject(DEPTNO=[$0], EXPR$1=[RAND()]) > LogicalAggregate(group=[{7}]) > LogicalTableScan(table=[[CATALOG, SALES, EMP]]) > > This plan is like the first plan, except that the LogicalProject > occurs after LogicalAggregate rather than before. > > Julian > > On Tue, May 3, 2022 at 7:32 AM Viliam Durina > <[email protected]> wrote: > > > > Hi all, > > > > I have this query: > > > > SELECT DISTINCT deptno > > FROM emp > > ORDER BY rand() > > > > Sql2rel conversion generates the following expression: > > > > LogicalProject(DEPTNO=[$0]) > > LogicalSort(sort0=[$1], dir0=[ASC]) > > LogicalAggregate(group=[{0, 1}]) > > LogicalProject(DEPTNO=[$7], EXPR$1=[RAND()]) > > LogicalTableScan(table=[[CATALOG, SALES, EMP]]) > > > > I think it is not correct. The aggregation groups by `{deptno, RAND()}` > > tuple, which (almost) disables the DISTINCT clause. This plan would be > > correct: > > > > LogicalProject(DEPTNO=[$0]) > > LogicalSort(sort0=[$1], dir0=[ASC]) > > LogicalProject(DEPTNO=[$0], EXPR$0=RAND()) > > LogicalAggregate(group=[{0}]) > > LogicalProject(DEPTNO=[$7]]) > > LogicalTableScan(table=[[CATALOG, SALES, EMP]]) > > > > PostgreSQL in this case throws "ERROR: for SELECT DISTINCT, ORDER BY > > expressions must appear in select list". > > > > On the other hand, this query: > > > > SELECT DISTINCT deptno > > FROM emp > > ORDER BY empno > > > > correctly throws the following exception: "Expression 'EMPNO' is not in > the > > select clause". I think this is the right thing to do. > > > > There's also `SqlToRelConverterTest.testOrderDistinct()` test, which also > > asserts a wrong plan IMO. > > > > Do you agree this is an issue? > > > > Viliam > > > > -- > > This message contains confidential information and is intended only for > the > > individuals named. If you are not the named addressee you should not > > disseminate, distribute or copy this e-mail. Please notify the sender > > immediately by e-mail if you have received this e-mail by mistake and > > delete this e-mail from your system. E-mail transmission cannot be > > guaranteed to be secure or error-free as information could be > intercepted, > > corrupted, lost, destroyed, arrive late or incomplete, or contain > viruses. > > The sender therefore does not accept liability for any errors or > omissions > > in the contents of this message, which arise as a result of e-mail > > transmission. If verification is required, please request a hard-copy > > version. -Hazelcast > -- This message contains confidential information and is intended only for the individuals named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version. -Hazelcast
