Re: Why are nested aggregations illegal? Best alternatives?

Gavin Ray Mon, 21 Feb 2022 09:33:02 -0800

I hadn't thought about the fact that ORM's probably have to solve this
problem as well
That is a great suggestion, I will try to investigate some of the popular
ORM codebases and see if there are any tricks they are using.


I seem to maybe be getting a tiny bit closer by using subqueries like
Julian suggested instead of operator calls
But if I may ask what is probably a very stupid question:

What might the error message
"parse failed: Query expression encountered in illegal context
(state=,code=0)"

Mean in the below query?

The reason why I am confused is because the query runs if I remove the
innermost subquery ("todos")
But the innermost subquery is a direct copy-paste of the subquery above it,
so I know it MUST be valid

As usual, thank you so much for your help/guidance Stamatis.

select
  "g0"."id" "id",
  "g0"."address" "address",
  (
    select json_arrayagg(json_object(
      key 'id' value "g1"."id",
      key 'todos' value (
        select json_arrayagg(json_object(
          key 'id' value "g2"."id",
          key 'description' value "g2"."description",
        ))
        from (
          select * from "todos"
          where "g1"."id" = "user_id"
          order by "id"
        ) "g2"
      )
    ))
    from (
      select * from "users"
      where "g0"."id" = "house_id"
      order by "id"
    ) "g1"
  ) "users"
from "houses" "g0"
order by "g0"."id"

On Mon, Feb 21, 2022 at 8:07 AM Stamatis Zampetakis <[email protected]>
wrote:

> Hi Gavin,
>
> A few more comments in case they help to get you a bit further on your
> work.
>
> The need to return the result as a single object is a common problem in
> object relational mapping (ORM) frameworks/APIS (JPA, Datanucleus,
> Hibernate, etc.). Apart from the suggestions so far maybe you could look
> into these frameworks as well for more inspiration.
>
> Moreover your approach of decomposing the query into individual parts is
> commonly known as the N+1 problem [1].
>
> Lastly, keep in mind that you can introduce custom UDF, UDAF functions if
> you need more flexibility on reconstructing the final result.
>
> Best,
> Stamatis
>
> [1]
>
> https://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem-in-orm-object-relational-mapping
>
> On Sun, Feb 13, 2022 at 3:59 AM Gavin Ray <[email protected]> wrote:
>
> > Ah wait nevermind, got excited and spoke too soon. Looking at it more
> > closely, that data isn't correct.
> > At least it's in somewhat the right shape, ha!
> >
> > On Sat, Feb 12, 2022 at 9:57 PM Gavin Ray <[email protected]> wrote:
> >
> > > After ~5 hours, I think I may have made some progress =)
> > >
> > > I have this, which currently works. The problem is that the nested
> > columns
> > > don't have names on them.
> > > Since I need to return a nested "Map<String, Object>", I have to figure
> > > out how to convert this query into a form that gives column names.
> > >
> > > But this is still great progress I think!
> > >
> > > SELECT
> > >     "todos".*,
> > >     ARRAY(
> > >         SELECT
> > >             "users".*,
> > >             ARRAY(
> > >                 SELECT
> > >                     "todos".*
> > >                 FROM
> > >                     "todos"
> > >             ) AS "todos"
> > >         FROM
> > >             "users"
> > >     ) AS "users"
> > > FROM
> > >     "todos"
> > > WHERE
> > >     "user_id" IN (
> > >         SELECT
> > >             "user_id"
> > >         FROM
> > >             "users"
> > >         WHERE
> > >             "house_id" IN (
> > >                 SELECT
> > >                     "id"
> > >                 FROM
> > >                     "houses"
> > >             )
> > >     );
> > >
> > >
> > >
> > >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> > > | id | user_id |      description       |
> > >                                             |
> > >
> > >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> > > | 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > > | 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take
> out
> > > the trash}, {2, 1, Watch my favorite show}, { |
> > >
> > >
> >
> +----+---------+------------------------+------------------------------------------------------------------------------+
> > >
> > > On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <[email protected]>
> wrote:
> > >
> > >> Nevermind, this is a standard term not something Calcite-specific it
> > >> seems!
> > >>
> > >> https://en.wikipedia.org/wiki/Correlated_subquery
> > >>
> > >> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <[email protected]>
> > wrote:
> > >>
> > >>> Forgive my ignorance/lack of experience
> > >>>
> > >>> I am somewhat familiar with the ARRAY() function, but not sure I know
> > >>> the term "correlated"
> > >>> Searching the Calcite codebase for uses of "correlated" + "query", I
> > >>> found:
> > >>>
> > >>>
> > >>>
> >
> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
> > >>>
> > >>>   @Test void testCorrelatedSubQueryInJoin() {
> > >>>     final String sql = "select *\n"
> > >>>         + "from emp as e\n"
> > >>>         + "join dept as d using (deptno)\n"
> > >>>         + "where d.name = (\n"
> > >>>         + "  select max(name)\n"
> > >>>         + "  from dept as d2\n"
> > >>>         + "  where d2.deptno = d.deptno)";
> > >>>     sql(sql).withExpand(false).ok();
> > >>>   }
> > >>>
> > >>> But I also see this, which says it is "uncorrelated" but seems very
> > >>> similar?
> > >>>
> > >>>   @Test void testInUncorrelatedSubQuery() {
> > >>>     final String sql = "select empno from emp where deptno in"
> > >>>         + " (select deptno from dept)";
> > >>>     sql(sql).ok();
> > >>>   }
> > >>>
> > >>> I wouldn't blame you for not answering such a basic question -- but
> > what
> > >>> exactly does "correlation" mean here?
> > >>>
> > >>> Thanks, as usual Julian
> > >>>
> > >>>
> > >>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> Correlated ARRAY sub-query?
> > >>>>
> > >>>> > On Feb 12, 2022, at 10:40 AM, Gavin Ray <[email protected]>
> > >>>> wrote:
> > >>>> >
> > >>>> > Apologies for the delay in replying
> > >>>> >
> > >>>> > This makes things clear and seems obvious now that you point it
> out.
> > >>>> > Thank you, Justin and Julian =)
> > >>>> >
> > >>>> > Let me ask another question (if I may) that I am struggling to
> > phrase
> > >>>> > easily.
> > >>>> >
> > >>>> > So with GraphQL, you might have a query like:
> > >>>> > - "Get houses"
> > >>>> > - "For each house get the user that lives in the house
> > >>>> > - "And for each user get their list of todos"
> > >>>> >
> > >>>> > The result has to come back such that it's a single object for
> each
> > >>>> row
> > >>>> > ==============
> > >>>> > {
> > >>>> >  houses: [{
> > >>>> >    address: "123 Main Street",
> > >>>> >    users: [{
> > >>>> >      name: "Joe",
> > >>>> >      todos: [{
> > >>>> >        description: "Take out trash"
> > >>>> >      }]
> > >>>> >    }]
> > >>>> > }
> > >>>> >
> > >>>> > From a SQL perspective, the logical equivalent would be something
> > >>>> like:
> > >>>> > ==============
> > >>>> > SELECT
> > >>>> >    house.address,
> > >>>> >    (somehow nest users + double-nest todos under user)
> > >>>> > FROM
> > >>>> >    house
> > >>>> > JOIN
> > >>>> >    user ON user.house_id = house.id
> > >>>> >    todos ON todos.user_id = user.id
> > >>>> > WHERE
> > >>>> >    house.id = 1
> > >>>> >
> > >>>> > I'm not familiar enough with SQL to have figured out a way to make
> > >>>> this
> > >>>> > kind of
> > >>>> > query using operators that are supported across most of the DB's
> > >>>> Calcite has
> > >>>> > adapters for.
> > >>>> >
> > >>>> > Currently what I have done instead, on a tip from Gopalakrishna
> > Holla
> > >>>> from
> > >>>> > LinkedIn Coral team who has built GraphQL-on-Calcite, was to break
> > up
> > >>>> the
> > >>>> > query
> > >>>> > into individual parts and then do the join in-memory:
> > >>>> >
> > >>>> > SELECT ... FROM users;
> > >>>> > SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
> > >>>> response);
> > >>>> >
> > >>>> > However, the way I am doing this seems like it's probably very
> > >>>> inefficient.
> > >>>> > Because I do a series of nested loops to add the key to each
> object
> > >>>> in the
> > >>>> > parent ResultSet row:
> > >>>> >
> > >>>> >
> > >>>>
> >
> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
> > >>>> >
> > >>>> > Is there some better way of doing this?
> > >>>> > I would be eternally grateful for any advice.
> > >>>> >
> > >>>> >
> > >>>> > On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <
> [email protected]
> > >
> > >>>> wrote:
> > >>>> >
> > >>>> >> Yes, if you want to do multiple layers of aggregation, use CTEs
> > >>>> (WITH) or
> > >>>> >> nested sub-queries. For example, the following is I believe valid
> > >>>> standard
> > >>>> >> SQL, and actually computes something useful:
> > >>>> >>
> > >>>> >>  WITH q1 AS
> > >>>> >>   (SELECT deptno, job, AVG(sal) AS avg_sal
> > >>>> >>    FROM emp
> > >>>> >>    GROUP BY deptno, job)
> > >>>> >>  WITH q2 AS
> > >>>> >>   (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
> > >>>> >>    FROM q1
> > >>>> >>    GROUP BY deptno)
> > >>>> >>  SELECT AVG(avg_avg_sal)
> > >>>> >>  FROM q2
> > >>>> >>  GROUP BY ()
> > >>>> >>
> > >>>> >> (You can omit the “GROUP BY ()” line, but I think it makes things
> > >>>> clearer.)
> > >>>> >>
> > >>>> >> Julian
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <
> > [email protected]>
> > >>>> >> wrote:
> > >>>> >>>
> > >>>> >>> I wish you could unsend emails :)  Answering my own question,
> no,
> > >>>> because
> > >>>> >>> that would return three rows with the average :D
> > >>>> >>>
> > >>>> >>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <
> > >>>> [email protected]>
> > >>>> >> wrote:
> > >>>> >>>
> > >>>> >>>> Just out of curiosity, is the second level aggregation using
> AVG
> > >>>> in a
> > >>>> >>>> window context?  It the frame is the whole table and it
> > aggregates
> > >>>> over
> > >>>> >> it?
> > >>>> >>>>
> > >>>> >>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
> > >>>> [email protected]>
> > >>>> >>>> wrote:
> > >>>> >>>>
> > >>>> >>>>> That is really neat about Oracle.
> > >>>> >>>>>
> > >>>> >>>>> The alternative in general is to use a subquery:
> > >>>> >>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> > >>>> >>>>> becomes
> > >>>> >>>>> select avg(the_avg)
> > >>>> >>>>> from (select avg(sal) from emp group b deptno) an_alias;
> > >>>> >>>>>
> > >>>> >>>>> or
> > >>>> >>>>>
> > >>>> >>>>> with the_cte as (select avg(sal) x from emp group by deptno)
> > >>>> >>>>> select avg(x) from the_cte;
> > >>>> >>>>>
> > >>>> >>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
> > >>>> [email protected]>
> > >>>> >>>>> wrote:
> > >>>> >>>>>
> > >>>> >>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
> > >>>> >>>>>>
> > >>>> >>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
> > >>>> >>>>>>
> > >>>> >>>>>> AVG(SAL)
> > >>>> >>>>>> ========
> > >>>> >>>>>> 1,566.67
> > >>>> >>>>>> 2,175.00
> > >>>> >>>>>> 2,916.65
> > >>>> >>>>>>
> > >>>> >>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> > >>>> >>>>>>
> > >>>> >>>>>> AVG(SUM(SAL))
> > >>>> >>>>>> =============
> > >>>> >>>>>>       9,675
> > >>>> >>>>>>
> > >>>> >>>>>> The first level aggregates by department (returning 3
> records),
> > >>>> and
> > >>>> >> the
> > >>>> >>>>>> second level computes the grand total (returning 1 record).
> But
> > >>>> that
> > >>>> >> is an
> > >>>> >>>>>> exceptional case.
> > >>>> >>>>>>
> > >>>> >>>>>> Generally, any expression in the SELECT or HAVING clause of
> an
> > >>>> >> aggregate
> > >>>> >>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
> > >>>> >>>>>>
> > >>>> >>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
> > >>>> >>>>>> FROM t
> > >>>> >>>>>> GROUP BY t.x
> > >>>> >>>>>>
> > >>>> >>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
> > >>>> “t.x”,
> > >>>> >>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
> > >>>> >> aggregation.
> > >>>> >>>>>> SQL semantics rely heavily on this stratification. Allowing
> an
> > >>>> extra
> > >>>> >> level
> > >>>> >>>>>> of aggregation would mess it all up.
> > >>>> >>>>>>
> > >>>> >>>>>> Julian
> > >>>> >>>>>>
> > >>>> >>>>>>
> > >>>> >>>>>>
> > >>>> >>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
> > >>>> [email protected]>
> > >>>> >>>>>> wrote:
> > >>>> >>>>>>>
> > >>>> >>>>>>> This is a SQL limitation.
> > >>>> >>>>>>>
> > >>>> >>>>>>> mysql> select sum(1);
> > >>>> >>>>>>> +--------+
> > >>>> >>>>>>> | sum(1) |
> > >>>> >>>>>>> +--------+
> > >>>> >>>>>>> |      1 |
> > >>>> >>>>>>> +--------+
> > >>>> >>>>>>> 1 row in set (0.00 sec)
> > >>>> >>>>>>>
> > >>>> >>>>>>> mysql> select sum(sum(1));
> > >>>> >>>>>>> ERROR 1111 (HY000): Invalid use of group function
> > >>>> >>>>>>>
> > >>>> >>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
> > >>>> [email protected]>
> > >>>> >>>>>> wrote:
> > >>>> >>>>>>>
> > >>>> >>>>>>>> Went to test this query out and found that it can't be
> > >>>> performed:
> > >>>> >>>>>>>>
> > >>>> >>>>>>>> SELECT
> > >>>> >>>>>>>>  JSON_OBJECT(
> > >>>> >>>>>>>>      KEY 'users'
> > >>>> >>>>>>>>      VALUE JSON_ARRAYAGG(
> > >>>> >>>>>>>>          JSON_OBJECT(
> > >>>> >>>>>>>>              KEY 'name' VALUE "users"."name",
> > >>>> >>>>>>>>              KEY 'todos' VALUE JSON_ARRAYAGG(
> > >>>> >>>>>>>>                  JSON_OBJECT(
> > >>>> >>>>>>>>                      KEY 'description' VALUE
> > >>>> "todos"."description"
> > >>>> >>>>>>>>                  )
> > >>>> >>>>>>>>              )
> > >>>> >>>>>>>>          )
> > >>>> >>>>>>>>      )
> > >>>> >>>>>>>>  )
> > >>>> >>>>>>>> FROM
> > >>>> >>>>>>>>  "users"
> > >>>> >>>>>>>> LEFT OUTER JOIN
> > >>>> >>>>>>>>  "todos" ON "users"."id" = "todos"."user_id";
> > >>>> >>>>>>>>
> > >>>> >>>>>>>> Checking the source, seems this is a blanket policy, not a
> > >>>> >>>>>>>> datasource-specific thing.
> > >>>> >>>>>>>> From a functional perspective, it doesn't feel like it's
> much
> > >>>> >>>>>> different
> > >>>> >>>>>>>> from JOINs
> > >>>> >>>>>>>> But I don't understand relational theory or DB
> functionality
> > >>>> in the
> > >>>> >>>>>> least,
> > >>>> >>>>>>>> so I'm not fit to judge.
> > >>>> >>>>>>>>
> > >>>> >>>>>>>> Just curious why Calcite doesn't allow this
> > >>>> >>>>>>>>
> > >>>> >>>>>>
> > >>>> >>>>>>
> > >>>> >>
> > >>>> >>
> > >>>>
> > >>>>
> >
>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to