ANY_VALUE sounds good. There could instead/in addition be SAME_VALUE (which is similar, but asserts that all values are the same).
> On Apr 4, 2018, at 3:33 PM, Aman Sinha <amansi...@apache.org> wrote: > > Would the Calcite community be ok with having a new function called > 'ANY_VALUE()' which is of type SqlAggFunction ? > The usage of this would be with a GROUP BY, similar to what Gian described > for Druid but with the difference that the > keyword any_value makes it explicit that there is no ordering requirement. > If someone really needs ordering, then there is always the > fallback of the ROW_NUMBER window function syntax. > > -Aman > > > On Wed, Mar 7, 2018 at 3:50 PM, Gian Merlino <g...@imply.io> wrote: > >> In Druid I was hoping to eventually deal with this using FIRST and LAST >> aggregators. So the original query would be: >> >> SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a >> >> However, we have only implemented FIRST and LAST for numeric types, not for >> strings or arrays yet. And we also haven't wired it into Druid SQL yet >> (they are available in Druid's native query language only). Also in the >> Druid the FIRST/LAST aggregators pick by first/last values by timestamp, >> not by the ORDER BY clause of the query in general. >> >> The idea would be you use them either when you really do want the >> first/last value to be seen (by time), or when you just want _any_ value >> and you don't care which one. >> >> Gian >> >> On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <jh...@apache.org> wrote: >> >>> I always wished there was a variant of LIMIT that operated per key. E.g. >>> return the youngest 2 employees in each department: >>> >>> SELECT * >>> FROM Amp >>> ORDER BY deptno LIMIT 2, birthdate DESC >>> >>> (This is not standard syntax.) >>> >>> DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with >>> SQL evaluation order the way DISTINCT ON does. >>> >>> Also, per-key limit is something that a sort-limit physical operator >> could >>> usefully and efficiently do. >>> >>>> On Mar 4, 2018, at 8:09 PM, Aman Sinha <amansi...@apache.org> wrote: >>>> >>>> So far, I haven't had much success.. other databases such as Oracle, >> SQL >>>> Server and DB2 don't seem to support this at least based on their >>> published >>>> SQL reference guides. >>>> >>>> One way people work around this is to use ROW_NUMBER() window function >> to >>>> achieve similar results. >>>> e.g >>>> SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION >> BY >>>> a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1; >>>> >>>> Regarding the aggregate functions allowing arbitrary values in a >> group, I >>>> am not aware of databases that do that. I am not sure how to verify >> this >>>> though. >>>> >>>> -Aman >>>> >>>> On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <amansi...@apache.org> >>> wrote: >>>> >>>>> I will try to find out more about this during the next few >> days....what >>> do >>>>> Oracle, SQL Server, DB2 support. >>>>> >>>>> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <jh...@apache.org> >> wrote: >>>>> >>>>>> Can you do some research, and see if any other databases do anything >>>>>> similar? Since Postgres isn’t standard, maybe we can improve upon it >> a >>> bit. >>>>>> >>>>>> Also, do any databases have an aggregate function that takes an >>> arbitrary >>>>>> value in the group, or asserts that all values are the same? We would >>> use >>>>>> that when we translate the syntactic sugar to algebra. >>>>>> >>>>>> Julian >>>>>> >>>>>> >>>>>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <amansi...@apache.org> >> wrote: >>>>>>> >>>>>>> It is a useful functionality, especially since the SELECT list can >>>>>> contain >>>>>>> mix of primitive types and array or map type columns. >>>>>>> I do see your point about the Postgres semantics of ORDER BY vs >>> DISTINCT >>>>>>> ON. >>>>>>> I don't like the hidden semantics of MySQL either. >>>>>>> Perhaps the compromise solution is to introduce functions such as >>>>>>> FIRST_ROW() aggregation function (similar to lead/lag window >>> functions) >>>>>> and >>>>>>> use GROUP BY. >>>>>>> >>>>>>> -Aman >>>>>>> >>>>>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <jh...@apache.org> >>> wrote: >>>>>>> >>>>>>>> I can see that it would be useful. But there are a couple of things >>>>>> about >>>>>>>> it that are messy. >>>>>>>> >>>>>>>> I think the syntax is a bit clunky because it uses parentheses; a >>> query >>>>>>>> would be difficult to read if people would like multiple columns, >>>>>>>> expressions, and aliases. >>>>>>>> >>>>>>>> I also think the semantics are messy. If you read >>>>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html# >>>>>> SQL-DISTINCT < >>>>>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html# >>>>>> SQL-DISTINCT> >>>>>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY >>> clause, >>>>>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause. >> So >>>>>> there >>>>>>>> will be a bizarre interaction if DISTINCT ON is used with UNION and >>>>>> ORDER >>>>>>>> BY. >>>>>>>> >>>>>>>> By the way, good ol’ MySQL doesn’t have this problem; you can just >>>>>> write >>>>>>>> >>>>>>>> SELECT a, b, c FROM t GROUP BY a >>>>>>>> >>>>>>>> and it just picks the first value of b and c. This “feature” is one >>> of >>>>>> my >>>>>>>> least favorite things about MySQL, so let’s not emulate it. :) >>>>>>>> >>>>>>>> Julian >>>>>>>> >>>>>>>> >>>>>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <amansi...@apache.org> >>>>>> wrote: >>>>>>>>> >>>>>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see >> if >>>>>> there >>>>>>>>> are other people who have run into this. One of the use cases I >>> have >>>>>>>>> intuitively maps to something like this: >>>>>>>>> >>>>>>>>> SELECT DISTINCT *ON (a)*, b, c FROM T >>>>>>>>> >>>>>>>>> Here suppose 'a' is an INT and b, c are some complex types such >> as >>>>>>>>> array. In my example, the values of b and c happen to be the >> same >>> in >>>>>>>> all >>>>>>>>> rows belonging to the same group of 'a', so I just want the first >>> row. >>>>>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a . Other >>>>>>>>> alternative is to create a new aggregate function that picks the >>> first >>>>>>>>> value but it makes the syntax verbose especially with large number >>> of >>>>>>>> such >>>>>>>>> columns. >>>>>>>>> >>>>>>>>> Incidentally, Postgres supports this [1] >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://www.postgresql.org/docs/current/static/sql- >>>>>>>> select.html#SQL-DISTINCT >>>>>>>>> >>>>>>>>> >>>>>>>>> -Aman >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>> >>> >>