Exactly, that's the point. RAND() isn't pure. But Calcite doesn't know
which functions are pure. I debugged the `AggChecker`, for function calls
it only checks the operands - RAND() has none, so it assumes it's pure. But
a function in general can depend on data other than its arguments. However
from functions included in Calcite maybe only RAND is non-pure.

I hit this because in my case, our user ordered the results using a complex
expression. After looking at the plan I saw it evaluated this expression
for each row after table scan, even though DISTINCT greatly reduced the
number of rows. If the complex order-by expression was evaluated after
duplicate removal, the query would be faster, even though in my case the
result set was correct. In my case evaluating the order by expressions
after aggregation would improve performance.

I also tried this query:
SELECT DISTINCT empno, deptno + 1
FROM emp
ORDER BY deptno + empno + 1

It fails with: "Expression 'DEPTNO' is not in the select clause". So `ORDER
BY deptno + 1 + empno` works, but `ORDER BY deptno + emptno + 1` doesn't,
even though it's an equivalent expression. The code in AggChecker doesn't
handle all cases.

Viliam

On Tue, 3 May 2022 at 22:33, Julian Hyde <[email protected]> wrote:

> I think that Postgres is a little stricter than necessary. As long as
> the operations are pure, the ORDER BY clause of a SELECT DISTINCT
> query should allow any expression that is functionally dependent on
> the expressions in the SELECT clause. In particular,
>
>   SELECT DISTINCT x, y
>   FROM t
>   ORDER BY x + y + 1
>
> should be valid because "x + y + 1" is functionally dependent.
>
> A pure parameterless function, such as CURRENT_DATE, is trivially
> functionally dependent. Therefore
>
>   SELECT DISTINCT x
>   FROM t
>   ORDER BY CURRENT_DATE + x
>
> should also be valid.
>
> But RAND() isn't pure. There is an argument to ban non-pure and
> non-deterministic expressions from the ORDER BY clause of a SELECT
> DISTINCT query. But there's also an argument to allow them and ensure
> that they are evaluated once per row. In which case the plan would be
>
>   LogicalProject(DEPTNO=[$0])
>     LogicalSort(sort0=[$1], dir0=[ASC])
>       LogicalProject(DEPTNO=[$0], EXPR$1=[RAND()])
>         LogicalAggregate(group=[{7}])
>           LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>
> This plan is like the first plan, except that the LogicalProject
> occurs after LogicalAggregate rather than before.
>
> Julian
>
> On Tue, May 3, 2022 at 7:32 AM Viliam Durina
> <[email protected]> wrote:
> >
> > Hi all,
> >
> > I have this query:
> >
> > SELECT DISTINCT deptno
> > FROM emp
> > ORDER BY rand()
> >
> > Sql2rel conversion generates the following expression:
> >
> > LogicalProject(DEPTNO=[$0])
> >   LogicalSort(sort0=[$1], dir0=[ASC])
> >     LogicalAggregate(group=[{0, 1}])
> >       LogicalProject(DEPTNO=[$7], EXPR$1=[RAND()])
> >         LogicalTableScan(table=[[CATALOG, SALES, EMP]])
> >
> > I think it is not correct. The aggregation groups by `{deptno, RAND()}`
> > tuple, which (almost) disables the DISTINCT clause. This plan would be
> > correct:
> >
> > LogicalProject(DEPTNO=[$0])
> >   LogicalSort(sort0=[$1], dir0=[ASC])
> >     LogicalProject(DEPTNO=[$0], EXPR$0=RAND())
> >       LogicalAggregate(group=[{0}])
> >         LogicalProject(DEPTNO=[$7]])
> >           LogicalTableScan(table=[[CATALOG, SALES, EMP]])
> >
> > PostgreSQL in this case throws "ERROR: for SELECT DISTINCT, ORDER BY
> > expressions must appear in select list".
> >
> > On the other hand, this query:
> >
> > SELECT DISTINCT deptno
> > FROM emp
> > ORDER BY empno
> >
> > correctly throws the following exception: "Expression 'EMPNO' is not in
> the
> > select clause". I think this is the right thing to do.
> >
> > There's also `SqlToRelConverterTest.testOrderDistinct()` test, which also
> > asserts a wrong plan IMO.
> >
> > Do you agree this is an issue?
> >
> > Viliam
> >
> > --
> > This message contains confidential information and is intended only for
> the
> > individuals named. If you are not the named addressee you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately by e-mail if you have received this e-mail by mistake and
> > delete this e-mail from your system. E-mail transmission cannot be
> > guaranteed to be secure or error-free as information could be
> intercepted,
> > corrupted, lost, destroyed, arrive late or incomplete, or contain
> viruses.
> > The sender therefore does not accept liability for any errors or
> omissions
> > in the contents of this message, which arise as a result of e-mail
> > transmission. If verification is required, please request a hard-copy
> > version. -Hazelcast
>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Reply via email to