Re: Why are nested aggregations illegal? Best alternatives?

Gavin Ray Sat, 12 Feb 2022 18:58:01 -0800

After ~5 hours, I think I may have made some progress =)

I have this, which currently works. The problem is that the nested columns
don't have names on them.
Since I need to return a nested "Map<String, Object>", I have to figure out
how to convert this query into a form that gives column names.


But this is still great progress I think!

SELECT
    "todos".*,
    ARRAY(
        SELECT
            "users".*,
            ARRAY(
                SELECT
                    "todos".*
                FROM
                    "todos"
            ) AS "todos"
        FROM
            "users"
    ) AS "users"
FROM
    "todos"
WHERE
    "user_id" IN (
        SELECT
            "user_id"
        FROM
            "users"
        WHERE
            "house_id" IN (
                SELECT
                    "id"
                FROM
                    "houses"
            )
    );


+----+---------+------------------------+------------------------------------------------------------------------------+
| id | user_id |      description       |
                                            |
+----+---------+------------------------+------------------------------------------------------------------------------+
| 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
| 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take out
the trash}, {2, 1, Watch my favorite show}, { |
+----+---------+------------------------+------------------------------------------------------------------------------+

On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <[email protected]> wrote:

> Nevermind, this is a standard term not something Calcite-specific it seems!
>
> https://en.wikipedia.org/wiki/Correlated_subquery
>
> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <[email protected]> wrote:
>
>> Forgive my ignorance/lack of experience
>>
>> I am somewhat familiar with the ARRAY() function, but not sure I know the
>> term "correlated"
>> Searching the Calcite codebase for uses of "correlated" + "query", I
>> found:
>>
>>
>> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
>>
>>   @Test void testCorrelatedSubQueryInJoin() {
>>     final String sql = "select *\n"
>>         + "from emp as e\n"
>>         + "join dept as d using (deptno)\n"
>>         + "where d.name = (\n"
>>         + "  select max(name)\n"
>>         + "  from dept as d2\n"
>>         + "  where d2.deptno = d.deptno)";
>>     sql(sql).withExpand(false).ok();
>>   }
>>
>> But I also see this, which says it is "uncorrelated" but seems very
>> similar?
>>
>>   @Test void testInUncorrelatedSubQuery() {
>>     final String sql = "select empno from emp where deptno in"
>>         + " (select deptno from dept)";
>>     sql(sql).ok();
>>   }
>>
>> I wouldn't blame you for not answering such a basic question -- but what
>> exactly does "correlation" mean here?
>>
>> Thanks, as usual Julian
>>
>>
>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <[email protected]>
>> wrote:
>>
>>> Correlated ARRAY sub-query?
>>>
>>> > On Feb 12, 2022, at 10:40 AM, Gavin Ray <[email protected]> wrote:
>>> >
>>> > Apologies for the delay in replying
>>> >
>>> > This makes things clear and seems obvious now that you point it out.
>>> > Thank you, Justin and Julian =)
>>> >
>>> > Let me ask another question (if I may) that I am struggling to phrase
>>> > easily.
>>> >
>>> > So with GraphQL, you might have a query like:
>>> > - "Get houses"
>>> > - "For each house get the user that lives in the house
>>> > - "And for each user get their list of todos"
>>> >
>>> > The result has to come back such that it's a single object for each row
>>> > ==============
>>> > {
>>> >  houses: [{
>>> >    address: "123 Main Street",
>>> >    users: [{
>>> >      name: "Joe",
>>> >      todos: [{
>>> >        description: "Take out trash"
>>> >      }]
>>> >    }]
>>> > }
>>> >
>>> > From a SQL perspective, the logical equivalent would be something like:
>>> > ==============
>>> > SELECT
>>> >    house.address,
>>> >    (somehow nest users + double-nest todos under user)
>>> > FROM
>>> >    house
>>> > JOIN
>>> >    user ON user.house_id = house.id
>>> >    todos ON todos.user_id = user.id
>>> > WHERE
>>> >    house.id = 1
>>> >
>>> > I'm not familiar enough with SQL to have figured out a way to make this
>>> > kind of
>>> > query using operators that are supported across most of the DB's
>>> Calcite has
>>> > adapters for.
>>> >
>>> > Currently what I have done instead, on a tip from Gopalakrishna Holla
>>> from
>>> > LinkedIn Coral team who has built GraphQL-on-Calcite, was to break up
>>> the
>>> > query
>>> > into individual parts and then do the join in-memory:
>>> >
>>> > SELECT ... FROM users;
>>> > SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
>>> response);
>>> >
>>> > However, the way I am doing this seems like it's probably very
>>> inefficient.
>>> > Because I do a series of nested loops to add the key to each object in
>>> the
>>> > parent ResultSet row:
>>> >
>>> >
>>> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
>>> >
>>> > Is there some better way of doing this?
>>> > I would be eternally grateful for any advice.
>>> >
>>> >
>>> > On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <[email protected]>
>>> wrote:
>>> >
>>> >> Yes, if you want to do multiple layers of aggregation, use CTEs
>>> (WITH) or
>>> >> nested sub-queries. For example, the following is I believe valid
>>> standard
>>> >> SQL, and actually computes something useful:
>>> >>
>>> >>  WITH q1 AS
>>> >>   (SELECT deptno, job, AVG(sal) AS avg_sal
>>> >>    FROM emp
>>> >>    GROUP BY deptno, job)
>>> >>  WITH q2 AS
>>> >>   (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
>>> >>    FROM q1
>>> >>    GROUP BY deptno)
>>> >>  SELECT AVG(avg_avg_sal)
>>> >>  FROM q2
>>> >>  GROUP BY ()
>>> >>
>>> >> (You can omit the “GROUP BY ()” line, but I think it makes things
>>> clearer.)
>>> >>
>>> >> Julian
>>> >>
>>> >>
>>> >>
>>> >>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>> I wish you could unsend emails :)  Answering my own question, no,
>>> because
>>> >>> that would return three rows with the average :D
>>> >>>
>>> >>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <[email protected]
>>> >
>>> >> wrote:
>>> >>>
>>> >>>> Just out of curiosity, is the second level aggregation using AVG in
>>> a
>>> >>>> window context?  It the frame is the whole table and it aggregates
>>> over
>>> >> it?
>>> >>>>
>>> >>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
>>> [email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> That is really neat about Oracle.
>>> >>>>>
>>> >>>>> The alternative in general is to use a subquery:
>>> >>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>> >>>>> becomes
>>> >>>>> select avg(the_avg)
>>> >>>>> from (select avg(sal) from emp group b deptno) an_alias;
>>> >>>>>
>>> >>>>> or
>>> >>>>>
>>> >>>>> with the_cte as (select avg(sal) x from emp group by deptno)
>>> >>>>> select avg(x) from the_cte;
>>> >>>>>
>>> >>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
>>> [email protected]>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
>>> >>>>>>
>>> >>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
>>> >>>>>>
>>> >>>>>> AVG(SAL)
>>> >>>>>> ========
>>> >>>>>> 1,566.67
>>> >>>>>> 2,175.00
>>> >>>>>> 2,916.65
>>> >>>>>>
>>> >>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>> >>>>>>
>>> >>>>>> AVG(SUM(SAL))
>>> >>>>>> =============
>>> >>>>>>       9,675
>>> >>>>>>
>>> >>>>>> The first level aggregates by department (returning 3 records),
>>> and
>>> >> the
>>> >>>>>> second level computes the grand total (returning 1 record). But
>>> that
>>> >> is an
>>> >>>>>> exceptional case.
>>> >>>>>>
>>> >>>>>> Generally, any expression in the SELECT or HAVING clause of an
>>> >> aggregate
>>> >>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
>>> >>>>>>
>>> >>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
>>> >>>>>> FROM t
>>> >>>>>> GROUP BY t.x
>>> >>>>>>
>>> >>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
>>> “t.x”,
>>> >>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
>>> >> aggregation.
>>> >>>>>> SQL semantics rely heavily on this stratification. Allowing an
>>> extra
>>> >> level
>>> >>>>>> of aggregation would mess it all up.
>>> >>>>>>
>>> >>>>>> Julian
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
>>> [email protected]>
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> This is a SQL limitation.
>>> >>>>>>>
>>> >>>>>>> mysql> select sum(1);
>>> >>>>>>> +--------+
>>> >>>>>>> | sum(1) |
>>> >>>>>>> +--------+
>>> >>>>>>> |      1 |
>>> >>>>>>> +--------+
>>> >>>>>>> 1 row in set (0.00 sec)
>>> >>>>>>>
>>> >>>>>>> mysql> select sum(sum(1));
>>> >>>>>>> ERROR 1111 (HY000): Invalid use of group function
>>> >>>>>>>
>>> >>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
>>> [email protected]>
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Went to test this query out and found that it can't be
>>> performed:
>>> >>>>>>>>
>>> >>>>>>>> SELECT
>>> >>>>>>>>  JSON_OBJECT(
>>> >>>>>>>>      KEY 'users'
>>> >>>>>>>>      VALUE JSON_ARRAYAGG(
>>> >>>>>>>>          JSON_OBJECT(
>>> >>>>>>>>              KEY 'name' VALUE "users"."name",
>>> >>>>>>>>              KEY 'todos' VALUE JSON_ARRAYAGG(
>>> >>>>>>>>                  JSON_OBJECT(
>>> >>>>>>>>                      KEY 'description' VALUE
>>> "todos"."description"
>>> >>>>>>>>                  )
>>> >>>>>>>>              )
>>> >>>>>>>>          )
>>> >>>>>>>>      )
>>> >>>>>>>>  )
>>> >>>>>>>> FROM
>>> >>>>>>>>  "users"
>>> >>>>>>>> LEFT OUTER JOIN
>>> >>>>>>>>  "todos" ON "users"."id" = "todos"."user_id";
>>> >>>>>>>>
>>> >>>>>>>> Checking the source, seems this is a blanket policy, not a
>>> >>>>>>>> datasource-specific thing.
>>> >>>>>>>> From a functional perspective, it doesn't feel like it's much
>>> >>>>>> different
>>> >>>>>>>> from JOINs
>>> >>>>>>>> But I don't understand relational theory or DB functionality in
>>> the
>>> >>>>>> least,
>>> >>>>>>>> so I'm not fit to judge.
>>> >>>>>>>>
>>> >>>>>>>> Just curious why Calcite doesn't allow this
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>
>>> >>
>>>
>>>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to