Re: DISTINCT ON clause

Julian Hyde Mon, 09 Apr 2018 10:01:44 -0700

ANY_VALUE sounds good. There could instead/in addition be SAME_VALUE (which is 
similar, but asserts that all values are the same).


> On Apr 4, 2018, at 3:33 PM, Aman Sinha <[email protected]> wrote:
> 
> Would the Calcite community be ok with having a new function called
> 'ANY_VALUE()' which is of type SqlAggFunction ?
> The usage of this would be with a GROUP BY,  similar to what Gian described
> for Druid but with the difference that the
> keyword any_value makes it explicit that there is no ordering requirement.
> If someone really needs ordering, then there is always the
> fallback of the ROW_NUMBER window function syntax.
> 
> -Aman
> 
> 
> On Wed, Mar 7, 2018 at 3:50 PM, Gian Merlino <[email protected]> wrote:
> 
>> In Druid I was hoping to eventually deal with this using FIRST and LAST
>> aggregators. So the original query would be:
>> 
>>  SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a
>> 
>> However, we have only implemented FIRST and LAST for numeric types, not for
>> strings or arrays yet. And we also haven't wired it into Druid SQL yet
>> (they are available in Druid's native query language only). Also in the
>> Druid the FIRST/LAST aggregators pick by first/last values by timestamp,
>> not by the ORDER BY clause of the query in general.
>> 
>> The idea would be you use them either when you really do want the
>> first/last value to be seen (by time), or when you just want _any_ value
>> and you don't care which one.
>> 
>> Gian
>> 
>> On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <[email protected]> wrote:
>> 
>>> I always wished there was a variant of LIMIT that operated per key. E.g.
>>> return the youngest 2 employees in each department:
>>> 
>>> SELECT *
>>> FROM Amp
>>> ORDER BY deptno LIMIT 2, birthdate DESC
>>> 
>>> (This is not standard syntax.)
>>> 
>>> DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with
>>> SQL evaluation order the way DISTINCT ON does.
>>> 
>>> Also, per-key limit is something that a sort-limit physical operator
>> could
>>> usefully and efficiently do.
>>> 
>>>> On Mar 4, 2018, at 8:09 PM, Aman Sinha <[email protected]> wrote:
>>>> 
>>>> So far, I haven't had much success.. other databases such as Oracle,
>> SQL
>>>> Server and DB2 don't seem to support this at least based on their
>>> published
>>>> SQL reference guides.
>>>> 
>>>> One way people work around this is to use ROW_NUMBER() window function
>> to
>>>> achieve similar results.
>>>> e.g
>>>>    SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION
>> BY
>>>> a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1;
>>>> 
>>>> Regarding the aggregate functions allowing arbitrary values in a
>> group, I
>>>> am not aware of databases that do that.  I am not sure how to verify
>> this
>>>> though.
>>>> 
>>>> -Aman
>>>> 
>>>> On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <[email protected]>
>>> wrote:
>>>> 
>>>>> I will try to find out more about this during the next few
>> days....what
>>> do
>>>>> Oracle, SQL Server, DB2 support.
>>>>> 
>>>>> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <[email protected]>
>> wrote:
>>>>> 
>>>>>> Can you do some research, and see if any other databases do anything
>>>>>> similar? Since Postgres isn’t standard, maybe we can improve upon it
>> a
>>> bit.
>>>>>> 
>>>>>> Also, do any databases have an aggregate function that takes an
>>> arbitrary
>>>>>> value in the group, or asserts that all values are the same? We would
>>> use
>>>>>> that when we translate the syntactic sugar to algebra.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>>> 
>>>>>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>> It is a useful functionality, especially since the SELECT list can
>>>>>> contain
>>>>>>> mix of primitive types and array or map type columns.
>>>>>>> I do see your point about the Postgres semantics of ORDER BY vs
>>> DISTINCT
>>>>>>> ON.
>>>>>>> I don't like the hidden semantics of MySQL either.
>>>>>>> Perhaps the compromise solution is to introduce functions such as
>>>>>>> FIRST_ROW() aggregation function (similar to lead/lag window
>>> functions)
>>>>>> and
>>>>>>> use GROUP BY.
>>>>>>> 
>>>>>>> -Aman
>>>>>>> 
>>>>>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> I can see that it would be useful. But there are a couple of things
>>>>>> about
>>>>>>>> it that are messy.
>>>>>>>> 
>>>>>>>> I think the syntax is a bit clunky because it uses parentheses; a
>>> query
>>>>>>>> would be difficult to read if people would like multiple columns,
>>>>>>>> expressions, and aliases.
>>>>>>>> 
>>>>>>>> I also think the semantics are messy. If you read
>>>>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
>>>>>> SQL-DISTINCT <
>>>>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
>>>>>> SQL-DISTINCT>
>>>>>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY
>>> clause,
>>>>>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause.
>> So
>>>>>> there
>>>>>>>> will be a bizarre interaction if DISTINCT ON is used with UNION and
>>>>>> ORDER
>>>>>>>> BY.
>>>>>>>> 
>>>>>>>> By the way, good ol’ MySQL doesn’t have this problem; you can just
>>>>>> write
>>>>>>>> 
>>>>>>>> SELECT a, b, c FROM t GROUP BY a
>>>>>>>> 
>>>>>>>> and it just picks the first value of b and c. This “feature” is one
>>> of
>>>>>> my
>>>>>>>> least favorite things about MySQL, so let’s not emulate it. :)
>>>>>>>> 
>>>>>>>> Julian
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <[email protected]>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see
>> if
>>>>>> there
>>>>>>>>> are other people who have run into this.  One of the use cases I
>>> have
>>>>>>>>> intuitively maps to something like this:
>>>>>>>>> 
>>>>>>>>> SELECT DISTINCT *ON (a)*,  b, c  FROM T
>>>>>>>>> 
>>>>>>>>> Here  suppose 'a' is an INT  and b, c are some complex types such
>> as
>>>>>>>>> array.  In my example, the values of  b and c happen to be the
>> same
>>> in
>>>>>>>> all
>>>>>>>>> rows belonging to the same group of 'a', so I just want the first
>>> row.
>>>>>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a .   Other
>>>>>>>>> alternative is to create a new aggregate function that picks the
>>> first
>>>>>>>>> value but it makes the syntax verbose especially with large number
>>> of
>>>>>>>> such
>>>>>>>>> columns.
>>>>>>>>> 
>>>>>>>>> Incidentally, Postgres supports this [1]
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> https://www.postgresql.org/docs/current/static/sql-
>>>>>>>> select.html#SQL-DISTINCT
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -Aman
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: DISTINCT ON clause

Reply via email to