Re: Why are nested aggregations illegal? Best alternatives?

Julian Hyde Wed, 23 Feb 2022 00:45:43 -0800

Try ‘value ((‘ in place of ‘value (‘. 

Julian


> On Feb 21, 2022, at 9:33 AM, Gavin Ray <[email protected]> wrote:
> 
> I hadn't thought about the fact that ORM's probably have to solve this
> problem as well
> That is a great suggestion, I will try to investigate some of the popular
> ORM codebases and see if there are any tricks they are using.
> 
> I seem to maybe be getting a tiny bit closer by using subqueries like
> Julian suggested instead of operator calls
> But if I may ask what is probably a very stupid question:
> 
> What might the error message
> "parse failed: Query expression encountered in illegal context
> (state=,code=0)"
> 
> Mean in the below query?
> 
> The reason why I am confused is because the query runs if I remove the
> innermost subquery ("todos")
> But the innermost subquery is a direct copy-paste of the subquery above it,
> so I know it MUST be valid
> 
> As usual, thank you so much for your help/guidance Stamatis.
> 
> select
>  "g0"."id" "id",
>  "g0"."address" "address",
>  (
>    select json_arrayagg(json_object(
>      key 'id' value "g1"."id",
>      key 'todos' value (
>        select json_arrayagg(json_object(
>          key 'id' value "g2"."id",
>          key 'description' value "g2"."description",
>        ))
>        from (
>          select * from "todos"
>          where "g1"."id" = "user_id"
>          order by "id"
>        ) "g2"
>      )
>    ))
>    from (
>      select * from "users"
>      where "g0"."id" = "house_id"
>      order by "id"
>    ) "g1"
>  ) "users"
> from "houses" "g0"
> order by "g0"."id"
> 
>> On Mon, Feb 21, 2022 at 8:07 AM Stamatis Zampetakis <[email protected]>
>> wrote:
>> 
>> Hi Gavin,
>> 
>> A few more comments in case they help to get you a bit further on your
>> work.
>> 
>> The need to return the result as a single object is a common problem in
>> object relational mapping (ORM) frameworks/APIS (JPA, Datanucleus,
>> Hibernate, etc.). Apart from the suggestions so far maybe you could look
>> into these frameworks as well for more inspiration.
>> 
>> Moreover your approach of decomposing the query into individual parts is
>> commonly known as the N+1 problem [1].
>> 
>> Lastly, keep in mind that you can introduce custom UDF, UDAF functions if
>> you need more flexibility on reconstructing the final result.
>> 
>> Best,
>> Stamatis
>> 
>> [1]
>> 
>> https://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem-in-orm-object-relational-mapping
>> 
>>> On Sun, Feb 13, 2022 at 3:59 AM Gavin Ray <[email protected]> wrote:
>>> 
>>> Ah wait nevermind, got excited and spoke too soon. Looking at it more
>>> closely, that data isn't correct.
>>> At least it's in somewhat the right shape, ha!
>>> 
>>>> On Sat, Feb 12, 2022 at 9:57 PM Gavin Ray <[email protected]> wrote:
>>> 
>>>> After ~5 hours, I think I may have made some progress =)
>>>> 
>>>> I have this, which currently works. The problem is that the nested
>>> columns
>>>> don't have names on them.
>>>> Since I need to return a nested "Map<String, Object>", I have to figure
>>>> out how to convert this query into a form that gives column names.
>>>> 
>>>> But this is still great progress I think!
>>>> 
>>>> SELECT
>>>>    "todos".*,
>>>>    ARRAY(
>>>>        SELECT
>>>>            "users".*,
>>>>            ARRAY(
>>>>                SELECT
>>>>                    "todos".*
>>>>                FROM
>>>>                    "todos"
>>>>            ) AS "todos"
>>>>        FROM
>>>>            "users"
>>>>    ) AS "users"
>>>> FROM
>>>>    "todos"
>>>> WHERE
>>>>    "user_id" IN (
>>>>        SELECT
>>>>            "user_id"
>>>>        FROM
>>>>            "users"
>>>>        WHERE
>>>>            "house_id" IN (
>>>>                SELECT
>>>>                    "id"
>>>>                FROM
>>>>                    "houses"
>>>>            )
>>>>    );
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>> | id | user_id |      description       |
>>>>                                            |
>>>> 
>>>> 
>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>> | 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> | 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take
>> out
>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>> 
>>>> 
>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>> 
>>>> On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <[email protected]>
>> wrote:
>>>> 
>>>>> Nevermind, this is a standard term not something Calcite-specific it
>>>>> seems!
>>>>> 
>>>>> https://en.wikipedia.org/wiki/Correlated_subquery
>>>>> 
>>>>> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <[email protected]>
>>> wrote:
>>>>> 
>>>>>> Forgive my ignorance/lack of experience
>>>>>> 
>>>>>> I am somewhat familiar with the ARRAY() function, but not sure I know
>>>>>> the term "correlated"
>>>>>> Searching the Calcite codebase for uses of "correlated" + "query", I
>>>>>> found:
>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
>>>>>> 
>>>>>>  @Test void testCorrelatedSubQueryInJoin() {
>>>>>>    final String sql = "select *\n"
>>>>>>        + "from emp as e\n"
>>>>>>        + "join dept as d using (deptno)\n"
>>>>>>        + "where d.name = (\n"
>>>>>>        + "  select max(name)\n"
>>>>>>        + "  from dept as d2\n"
>>>>>>        + "  where d2.deptno = d.deptno)";
>>>>>>    sql(sql).withExpand(false).ok();
>>>>>>  }
>>>>>> 
>>>>>> But I also see this, which says it is "uncorrelated" but seems very
>>>>>> similar?
>>>>>> 
>>>>>>  @Test void testInUncorrelatedSubQuery() {
>>>>>>    final String sql = "select empno from emp where deptno in"
>>>>>>        + " (select deptno from dept)";
>>>>>>    sql(sql).ok();
>>>>>>  }
>>>>>> 
>>>>>> I wouldn't blame you for not answering such a basic question -- but
>>> what
>>>>>> exactly does "correlation" mean here?
>>>>>> 
>>>>>> Thanks, as usual Julian
>>>>>> 
>>>>>> 
>>>>>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Correlated ARRAY sub-query?
>>>>>>> 
>>>>>>>> On Feb 12, 2022, at 10:40 AM, Gavin Ray <[email protected]>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Apologies for the delay in replying
>>>>>>>> 
>>>>>>>> This makes things clear and seems obvious now that you point it
>> out.
>>>>>>>> Thank you, Justin and Julian =)
>>>>>>>> 
>>>>>>>> Let me ask another question (if I may) that I am struggling to
>>> phrase
>>>>>>>> easily.
>>>>>>>> 
>>>>>>>> So with GraphQL, you might have a query like:
>>>>>>>> - "Get houses"
>>>>>>>> - "For each house get the user that lives in the house
>>>>>>>> - "And for each user get their list of todos"
>>>>>>>> 
>>>>>>>> The result has to come back such that it's a single object for
>> each
>>>>>>> row
>>>>>>>> ==============
>>>>>>>> {
>>>>>>>> houses: [{
>>>>>>>>   address: "123 Main Street",
>>>>>>>>   users: [{
>>>>>>>>     name: "Joe",
>>>>>>>>     todos: [{
>>>>>>>>       description: "Take out trash"
>>>>>>>>     }]
>>>>>>>>   }]
>>>>>>>> }
>>>>>>>> 
>>>>>>>> From a SQL perspective, the logical equivalent would be something
>>>>>>> like:
>>>>>>>> ==============
>>>>>>>> SELECT
>>>>>>>>   house.address,
>>>>>>>>   (somehow nest users + double-nest todos under user)
>>>>>>>> FROM
>>>>>>>>   house
>>>>>>>> JOIN
>>>>>>>>   user ON user.house_id = house.id
>>>>>>>>   todos ON todos.user_id = user.id
>>>>>>>> WHERE
>>>>>>>>   house.id = 1
>>>>>>>> 
>>>>>>>> I'm not familiar enough with SQL to have figured out a way to make
>>>>>>> this
>>>>>>>> kind of
>>>>>>>> query using operators that are supported across most of the DB's
>>>>>>> Calcite has
>>>>>>>> adapters for.
>>>>>>>> 
>>>>>>>> Currently what I have done instead, on a tip from Gopalakrishna
>>> Holla
>>>>>>> from
>>>>>>>> LinkedIn Coral team who has built GraphQL-on-Calcite, was to break
>>> up
>>>>>>> the
>>>>>>>> query
>>>>>>>> into individual parts and then do the join in-memory:
>>>>>>>> 
>>>>>>>> SELECT ... FROM users;
>>>>>>>> SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
>>>>>>> response);
>>>>>>>> 
>>>>>>>> However, the way I am doing this seems like it's probably very
>>>>>>> inefficient.
>>>>>>>> Because I do a series of nested loops to add the key to each
>> object
>>>>>>> in the
>>>>>>>> parent ResultSet row:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>> 
>> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
>>>>>>>> 
>>>>>>>> Is there some better way of doing this?
>>>>>>>> I would be eternally grateful for any advice.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <
>> [email protected]
>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Yes, if you want to do multiple layers of aggregation, use CTEs
>>>>>>> (WITH) or
>>>>>>>>> nested sub-queries. For example, the following is I believe valid
>>>>>>> standard
>>>>>>>>> SQL, and actually computes something useful:
>>>>>>>>> 
>>>>>>>>> WITH q1 AS
>>>>>>>>>  (SELECT deptno, job, AVG(sal) AS avg_sal
>>>>>>>>>   FROM emp
>>>>>>>>>   GROUP BY deptno, job)
>>>>>>>>> WITH q2 AS
>>>>>>>>>  (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
>>>>>>>>>   FROM q1
>>>>>>>>>   GROUP BY deptno)
>>>>>>>>> SELECT AVG(avg_avg_sal)
>>>>>>>>> FROM q2
>>>>>>>>> GROUP BY ()
>>>>>>>>> 
>>>>>>>>> (You can omit the “GROUP BY ()” line, but I think it makes things
>>>>>>> clearer.)
>>>>>>>>> 
>>>>>>>>> Julian
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <
>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I wish you could unsend emails :)  Answering my own question,
>> no,
>>>>>>> because
>>>>>>>>>> that would return three rows with the average :D
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <
>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Just out of curiosity, is the second level aggregation using
>> AVG
>>>>>>> in a
>>>>>>>>>>> window context?  It the frame is the whole table and it
>>> aggregates
>>>>>>> over
>>>>>>>>> it?
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> That is really neat about Oracle.
>>>>>>>>>>>> 
>>>>>>>>>>>> The alternative in general is to use a subquery:
>>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>>>>>>>>>>> becomes
>>>>>>>>>>>> select avg(the_avg)
>>>>>>>>>>>> from (select avg(sal) from emp group b deptno) an_alias;
>>>>>>>>>>>> 
>>>>>>>>>>>> or
>>>>>>>>>>>> 
>>>>>>>>>>>> with the_cte as (select avg(sal) x from emp group by deptno)
>>>>>>>>>>>> select avg(x) from the_cte;
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> AVG(SAL)
>>>>>>>>>>>>> ========
>>>>>>>>>>>>> 1,566.67
>>>>>>>>>>>>> 2,175.00
>>>>>>>>>>>>> 2,916.65
>>>>>>>>>>>>> 
>>>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> AVG(SUM(SAL))
>>>>>>>>>>>>> =============
>>>>>>>>>>>>>      9,675
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The first level aggregates by department (returning 3
>> records),
>>>>>>> and
>>>>>>>>> the
>>>>>>>>>>>>> second level computes the grand total (returning 1 record).
>> But
>>>>>>> that
>>>>>>>>> is an
>>>>>>>>>>>>> exceptional case.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Generally, any expression in the SELECT or HAVING clause of
>> an
>>>>>>>>> aggregate
>>>>>>>>>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
>>>>>>>>>>>>> 
>>>>>>>>>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
>>>>>>>>>>>>> FROM t
>>>>>>>>>>>>> GROUP BY t.x
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
>>>>>>> “t.x”,
>>>>>>>>>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
>>>>>>>>> aggregation.
>>>>>>>>>>>>> SQL semantics rely heavily on this stratification. Allowing
>> an
>>>>>>> extra
>>>>>>>>> level
>>>>>>>>>>>>> of aggregation would mess it all up.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Julian
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a SQL limitation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mysql> select sum(1);
>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>> | sum(1) |
>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>> |      1 |
>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>> 1 row in set (0.00 sec)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mysql> select sum(sum(1));
>>>>>>>>>>>>>> ERROR 1111 (HY000): Invalid use of group function
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Went to test this query out and found that it can't be
>>>>>>> performed:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SELECT
>>>>>>>>>>>>>>> JSON_OBJECT(
>>>>>>>>>>>>>>>     KEY 'users'
>>>>>>>>>>>>>>>     VALUE JSON_ARRAYAGG(
>>>>>>>>>>>>>>>         JSON_OBJECT(
>>>>>>>>>>>>>>>             KEY 'name' VALUE "users"."name",
>>>>>>>>>>>>>>>             KEY 'todos' VALUE JSON_ARRAYAGG(
>>>>>>>>>>>>>>>                 JSON_OBJECT(
>>>>>>>>>>>>>>>                     KEY 'description' VALUE
>>>>>>> "todos"."description"
>>>>>>>>>>>>>>>                 )
>>>>>>>>>>>>>>>             )
>>>>>>>>>>>>>>>         )
>>>>>>>>>>>>>>>     )
>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> FROM
>>>>>>>>>>>>>>> "users"
>>>>>>>>>>>>>>> LEFT OUTER JOIN
>>>>>>>>>>>>>>> "todos" ON "users"."id" = "todos"."user_id";
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Checking the source, seems this is a blanket policy, not a
>>>>>>>>>>>>>>> datasource-specific thing.
>>>>>>>>>>>>>>> From a functional perspective, it doesn't feel like it's
>> much
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> from JOINs
>>>>>>>>>>>>>>> But I don't understand relational theory or DB
>> functionality
>>>>>>> in the
>>>>>>>>>>>>> least,
>>>>>>>>>>>>>>> so I'm not fit to judge.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Just curious why Calcite doesn't allow this
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> 
>>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to