Re: DISTINCT ON clause

Julian Hyde Wed, 07 Mar 2018 15:17:45 -0800

I always wished there was a variant of LIMIT that operated per key. E.g. return 
the youngest 2 employees in each department:


SELECT *
FROM Amp
ORDER BY deptno LIMIT 2, birthdate DESC

(This is not standard syntax.)

DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with SQL 
evaluation order the way DISTINCT ON does.

Also, per-key limit is something that a sort-limit physical operator could 
usefully and efficiently do.

> On Mar 4, 2018, at 8:09 PM, Aman Sinha <[email protected]> wrote:
> 
> So far, I haven't had much success.. other databases such as Oracle, SQL
> Server and DB2 don't seem to support this at least based on their published
> SQL reference guides.
> 
> One way people work around this is to use ROW_NUMBER() window function to
> achieve similar results.
> e.g
>     SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION BY
> a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1;
> 
> Regarding the aggregate functions allowing arbitrary values in a group, I
> am not aware of databases that do that.  I am not sure how to verify this
> though.
> 
> -Aman
> 
> On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <[email protected]> wrote:
> 
>> I will try to find out more about this during the next few days....what do
>> Oracle, SQL Server, DB2 support.
>> 
>> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <[email protected]> wrote:
>> 
>>> Can you do some research, and see if any other databases do anything
>>> similar? Since Postgres isn’t standard, maybe we can improve upon it a bit.
>>> 
>>> Also, do any databases have an aggregate function that takes an arbitrary
>>> value in the group, or asserts that all values are the same? We would use
>>> that when we translate the syntactic sugar to algebra.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <[email protected]> wrote:
>>>> 
>>>> It is a useful functionality, especially since the SELECT list can
>>> contain
>>>> mix of primitive types and array or map type columns.
>>>> I do see your point about the Postgres semantics of ORDER BY vs DISTINCT
>>>> ON.
>>>> I don't like the hidden semantics of MySQL either.
>>>> Perhaps the compromise solution is to introduce functions such as
>>>> FIRST_ROW() aggregation function (similar to lead/lag window functions)
>>> and
>>>> use GROUP BY.
>>>> 
>>>> -Aman
>>>> 
>>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <[email protected]> wrote:
>>>> 
>>>>> I can see that it would be useful. But there are a couple of things
>>> about
>>>>> it that are messy.
>>>>> 
>>>>> I think the syntax is a bit clunky because it uses parentheses; a query
>>>>> would be difficult to read if people would like multiple columns,
>>>>> expressions, and aliases.
>>>>> 
>>>>> I also think the semantics are messy. If you read
>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
>>> SQL-DISTINCT <
>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
>>> SQL-DISTINCT>
>>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY clause,
>>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause. So
>>> there
>>>>> will be a bizarre interaction if DISTINCT ON is used with UNION and
>>> ORDER
>>>>> BY.
>>>>> 
>>>>> By the way, good ol’ MySQL doesn’t have this problem; you can just
>>> write
>>>>> 
>>>>> SELECT a, b, c FROM t GROUP BY a
>>>>> 
>>>>> and it just picks the first value of b and c. This “feature” is one of
>>> my
>>>>> least favorite things about MySQL, so let’s not emulate it. :)
>>>>> 
>>>>> Julian
>>>>> 
>>>>> 
>>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see if
>>> there
>>>>>> are other people who have run into this.  One of the use cases I have
>>>>>> intuitively maps to something like this:
>>>>>> 
>>>>>> SELECT DISTINCT *ON (a)*,  b, c  FROM T
>>>>>> 
>>>>>> Here  suppose 'a' is an INT  and b, c are some complex types such as
>>>>>> array.  In my example, the values of  b and c happen to be the same in
>>>>> all
>>>>>> rows belonging to the same group of 'a', so I just want the first row.
>>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a .   Other
>>>>>> alternative is to create a new aggregate function that picks the first
>>>>>> value but it makes the syntax verbose especially with large number of
>>>>> such
>>>>>> columns.
>>>>>> 
>>>>>> Incidentally, Postgres supports this [1]
>>>>>> 
>>>>>> [1]
>>>>>> https://www.postgresql.org/docs/current/static/sql-
>>>>> select.html#SQL-DISTINCT
>>>>>> 
>>>>>> 
>>>>>> -Aman
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: DISTINCT ON clause

Reply via email to