Re: Why are nested aggregations illegal? Best alternatives?

Gavin Ray Wed, 23 Feb 2022 09:22:36 -0800

No dice still unfortunately =/

If it's any easier, I put a self-contained single class reproduction on
Github:
https://github.com/GavinRay97/calcite-nested-json-subquery-bug/blob/main/src/test/java/com/example/AppTest.java


Is it worth filing a JIRA ticket over you think?


On Wed, Feb 23, 2022 at 3:45 AM Julian Hyde <[email protected]> wrote:

> Try ‘value ((‘ in place of ‘value (‘.
>
> Julian
>
> > On Feb 21, 2022, at 9:33 AM, Gavin Ray <[email protected]> wrote:
> >
> > I hadn't thought about the fact that ORM's probably have to solve this
> > problem as well
> > That is a great suggestion, I will try to investigate some of the popular
> > ORM codebases and see if there are any tricks they are using.
> >
> > I seem to maybe be getting a tiny bit closer by using subqueries like
> > Julian suggested instead of operator calls
> > But if I may ask what is probably a very stupid question:
> >
> > What might the error message
> > "parse failed: Query expression encountered in illegal context
> > (state=,code=0)"
> >
> > Mean in the below query?
> >
> > The reason why I am confused is because the query runs if I remove the
> > innermost subquery ("todos")
> > But the innermost subquery is a direct copy-paste of the subquery above
> it,
> > so I know it MUST be valid
> >
> > As usual, thank you so much for your help/guidance Stamatis.
> >
> > select
> >  "g0"."id" "id",
> >  "g0"."address" "address",
> >  (
> >    select json_arrayagg(json_object(
> >      key 'id' value "g1"."id",
> >      key 'todos' value (
> >        select json_arrayagg(json_object(
> >          key 'id' value "g2"."id",
> >          key 'description' value "g2"."description",
> >        ))
> >        from (
> >          select * from "todos"
> >          where "g1"."id" = "user_id"
> >          order by "id"
> >        ) "g2"
> >      )
> >    ))
> >    from (
> >      select * from "users"
> >      where "g0"."id" = "house_id"
> >      order by "id"
> >    ) "g1"
> >  ) "users"
> > from "houses" "g0"
> > order by "g0"."id"
> >
> >> On Mon, Feb 21, 2022 at 8:07 AM Stamatis Zampetakis <[email protected]>
> >> wrote:
> >>
> >> Hi Gavin,
> >>
> >> A few more comments in case they help to get you a bit further on your
> >> work.
> >>
> >> The need to return the result as a single object is a common problem in
> >> object relational mapping (ORM) frameworks/APIS (JPA, Datanucleus,
> >> Hibernate, etc.). Apart from the suggestions so far maybe you could look
> >> into these frameworks as well for more inspiration.
> >>
> >> Moreover your approach of decomposing the query into individual parts is
> >> commonly known as the N+1 problem [1].
> >>
> >> Lastly, keep in mind that you can introduce custom UDF, UDAF functions
> if
> >> you need more flexibility on reconstructing the final result.
> >>
> >> Best,
> >> Stamatis
> >>
> >> [1]
> >>
> >>
> https://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem-in-orm-object-relational-mapping
> >>
> >>> On Sun, Feb 13, 2022 at 3:59 AM Gavin Ray <[email protected]>
> wrote:
> >>>
> >>> Ah wait nevermind, got excited and spoke too soon. Looking at it more
> >>> closely, that data isn't correct.
> >>> At least it's in somewhat the right shape, ha!
> >>>
> >>>> On Sat, Feb 12, 2022 at 9:57 PM Gavin Ray <[email protected]>
> wrote:
> >>>
> >>>> After ~5 hours, I think I may have made some progress =)
> >>>>
> >>>> I have this, which currently works. The problem is that the nested
> >>> columns
> >>>> don't have names on them.
> >>>> Since I need to return a nested "Map<String, Object>", I have to
> figure
> >>>> out how to convert this query into a form that gives column names.
> >>>>
> >>>> But this is still great progress I think!
> >>>>
> >>>> SELECT
> >>>>    "todos".*,
> >>>>    ARRAY(
> >>>>        SELECT
> >>>>            "users".*,
> >>>>            ARRAY(
> >>>>                SELECT
> >>>>                    "todos".*
> >>>>                FROM
> >>>>                    "todos"
> >>>>            ) AS "todos"
> >>>>        FROM
> >>>>            "users"
> >>>>    ) AS "users"
> >>>> FROM
> >>>>    "todos"
> >>>> WHERE
> >>>>    "user_id" IN (
> >>>>        SELECT
> >>>>            "user_id"
> >>>>        FROM
> >>>>            "users"
> >>>>        WHERE
> >>>>            "house_id" IN (
> >>>>                SELECT
> >>>>                    "id"
> >>>>                FROM
> >>>>                    "houses"
> >>>>            )
> >>>>    );
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> +----+---------+------------------------+------------------------------------------------------------------------------+
> >>>> | id | user_id |      description       |
> >>>>                                            |
> >>>>
> >>>>
> >>>
> >>
> +----+---------+------------------------+------------------------------------------------------------------------------+
> >>>> | 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>> | 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take
> >> out
> >>>> the trash}, {2, 1, Watch my favorite show}, { |
> >>>>
> >>>>
> >>>
> >>
> +----+---------+------------------------+------------------------------------------------------------------------------+
> >>>>
> >>>> On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <[email protected]>
> >> wrote:
> >>>>
> >>>>> Nevermind, this is a standard term not something Calcite-specific it
> >>>>> seems!
> >>>>>
> >>>>> https://en.wikipedia.org/wiki/Correlated_subquery
> >>>>>
> >>>>> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> Forgive my ignorance/lack of experience
> >>>>>>
> >>>>>> I am somewhat familiar with the ARRAY() function, but not sure I
> know
> >>>>>> the term "correlated"
> >>>>>> Searching the Calcite codebase for uses of "correlated" + "query", I
> >>>>>> found:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
> >>>>>>
> >>>>>>  @Test void testCorrelatedSubQueryInJoin() {
> >>>>>>    final String sql = "select *\n"
> >>>>>>        + "from emp as e\n"
> >>>>>>        + "join dept as d using (deptno)\n"
> >>>>>>        + "where d.name = (\n"
> >>>>>>        + "  select max(name)\n"
> >>>>>>        + "  from dept as d2\n"
> >>>>>>        + "  where d2.deptno = d.deptno)";
> >>>>>>    sql(sql).withExpand(false).ok();
> >>>>>>  }
> >>>>>>
> >>>>>> But I also see this, which says it is "uncorrelated" but seems very
> >>>>>> similar?
> >>>>>>
> >>>>>>  @Test void testInUncorrelatedSubQuery() {
> >>>>>>    final String sql = "select empno from emp where deptno in"
> >>>>>>        + " (select deptno from dept)";
> >>>>>>    sql(sql).ok();
> >>>>>>  }
> >>>>>>
> >>>>>> I wouldn't blame you for not answering such a basic question -- but
> >>> what
> >>>>>> exactly does "correlation" mean here?
> >>>>>>
> >>>>>> Thanks, as usual Julian
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <[email protected]
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Correlated ARRAY sub-query?
> >>>>>>>
> >>>>>>>> On Feb 12, 2022, at 10:40 AM, Gavin Ray <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Apologies for the delay in replying
> >>>>>>>>
> >>>>>>>> This makes things clear and seems obvious now that you point it
> >> out.
> >>>>>>>> Thank you, Justin and Julian =)
> >>>>>>>>
> >>>>>>>> Let me ask another question (if I may) that I am struggling to
> >>> phrase
> >>>>>>>> easily.
> >>>>>>>>
> >>>>>>>> So with GraphQL, you might have a query like:
> >>>>>>>> - "Get houses"
> >>>>>>>> - "For each house get the user that lives in the house
> >>>>>>>> - "And for each user get their list of todos"
> >>>>>>>>
> >>>>>>>> The result has to come back such that it's a single object for
> >> each
> >>>>>>> row
> >>>>>>>> ==============
> >>>>>>>> {
> >>>>>>>> houses: [{
> >>>>>>>>   address: "123 Main Street",
> >>>>>>>>   users: [{
> >>>>>>>>     name: "Joe",
> >>>>>>>>     todos: [{
> >>>>>>>>       description: "Take out trash"
> >>>>>>>>     }]
> >>>>>>>>   }]
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> From a SQL perspective, the logical equivalent would be something
> >>>>>>> like:
> >>>>>>>> ==============
> >>>>>>>> SELECT
> >>>>>>>>   house.address,
> >>>>>>>>   (somehow nest users + double-nest todos under user)
> >>>>>>>> FROM
> >>>>>>>>   house
> >>>>>>>> JOIN
> >>>>>>>>   user ON user.house_id = house.id
> >>>>>>>>   todos ON todos.user_id = user.id
> >>>>>>>> WHERE
> >>>>>>>>   house.id = 1
> >>>>>>>>
> >>>>>>>> I'm not familiar enough with SQL to have figured out a way to make
> >>>>>>> this
> >>>>>>>> kind of
> >>>>>>>> query using operators that are supported across most of the DB's
> >>>>>>> Calcite has
> >>>>>>>> adapters for.
> >>>>>>>>
> >>>>>>>> Currently what I have done instead, on a tip from Gopalakrishna
> >>> Holla
> >>>>>>> from
> >>>>>>>> LinkedIn Coral team who has built GraphQL-on-Calcite, was to break
> >>> up
> >>>>>>> the
> >>>>>>>> query
> >>>>>>>> into individual parts and then do the join in-memory:
> >>>>>>>>
> >>>>>>>> SELECT ... FROM users;
> >>>>>>>> SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
> >>>>>>> response);
> >>>>>>>>
> >>>>>>>> However, the way I am doing this seems like it's probably very
> >>>>>>> inefficient.
> >>>>>>>> Because I do a series of nested loops to add the key to each
> >> object
> >>>>>>> in the
> >>>>>>>> parent ResultSet row:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> >>
> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
> >>>>>>>>
> >>>>>>>> Is there some better way of doing this?
> >>>>>>>> I would be eternally grateful for any advice.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <
> >> [email protected]
> >>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Yes, if you want to do multiple layers of aggregation, use CTEs
> >>>>>>> (WITH) or
> >>>>>>>>> nested sub-queries. For example, the following is I believe valid
> >>>>>>> standard
> >>>>>>>>> SQL, and actually computes something useful:
> >>>>>>>>>
> >>>>>>>>> WITH q1 AS
> >>>>>>>>>  (SELECT deptno, job, AVG(sal) AS avg_sal
> >>>>>>>>>   FROM emp
> >>>>>>>>>   GROUP BY deptno, job)
> >>>>>>>>> WITH q2 AS
> >>>>>>>>>  (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
> >>>>>>>>>   FROM q1
> >>>>>>>>>   GROUP BY deptno)
> >>>>>>>>> SELECT AVG(avg_avg_sal)
> >>>>>>>>> FROM q2
> >>>>>>>>> GROUP BY ()
> >>>>>>>>>
> >>>>>>>>> (You can omit the “GROUP BY ()” line, but I think it makes things
> >>>>>>> clearer.)
> >>>>>>>>>
> >>>>>>>>> Julian
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <
> >>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I wish you could unsend emails :)  Answering my own question,
> >> no,
> >>>>>>> because
> >>>>>>>>>> that would return three rows with the average :D
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <
> >>>>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Just out of curiosity, is the second level aggregation using
> >> AVG
> >>>>>>> in a
> >>>>>>>>>>> window context?  It the frame is the whole table and it
> >>> aggregates
> >>>>>>> over
> >>>>>>>>> it?
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
> >>>>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> That is really neat about Oracle.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The alternative in general is to use a subquery:
> >>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> >>>>>>>>>>>> becomes
> >>>>>>>>>>>> select avg(the_avg)
> >>>>>>>>>>>> from (select avg(sal) from emp group b deptno) an_alias;
> >>>>>>>>>>>>
> >>>>>>>>>>>> or
> >>>>>>>>>>>>
> >>>>>>>>>>>> with the_cte as (select avg(sal) x from emp group by deptno)
> >>>>>>>>>>>> select avg(x) from the_cte;
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
> >>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> AVG(SAL)
> >>>>>>>>>>>>> ========
> >>>>>>>>>>>>> 1,566.67
> >>>>>>>>>>>>> 2,175.00
> >>>>>>>>>>>>> 2,916.65
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> AVG(SUM(SAL))
> >>>>>>>>>>>>> =============
> >>>>>>>>>>>>>      9,675
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The first level aggregates by department (returning 3
> >> records),
> >>>>>>> and
> >>>>>>>>> the
> >>>>>>>>>>>>> second level computes the grand total (returning 1 record).
> >> But
> >>>>>>> that
> >>>>>>>>> is an
> >>>>>>>>>>>>> exceptional case.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Generally, any expression in the SELECT or HAVING clause of
> >> an
> >>>>>>>>> aggregate
> >>>>>>>>>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
> >>>>>>>>>>>>> FROM t
> >>>>>>>>>>>>> GROUP BY t.x
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
> >>>>>>> “t.x”,
> >>>>>>>>>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
> >>>>>>>>> aggregation.
> >>>>>>>>>>>>> SQL semantics rely heavily on this stratification. Allowing
> >> an
> >>>>>>> extra
> >>>>>>>>> level
> >>>>>>>>>>>>> of aggregation would mess it all up.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Julian
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
> >>>>>>> [email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is a SQL limitation.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> mysql> select sum(1);
> >>>>>>>>>>>>>> +--------+
> >>>>>>>>>>>>>> | sum(1) |
> >>>>>>>>>>>>>> +--------+
> >>>>>>>>>>>>>> |      1 |
> >>>>>>>>>>>>>> +--------+
> >>>>>>>>>>>>>> 1 row in set (0.00 sec)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> mysql> select sum(sum(1));
> >>>>>>>>>>>>>> ERROR 1111 (HY000): Invalid use of group function
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
> >>>>>>> [email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Went to test this query out and found that it can't be
> >>>>>>> performed:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SELECT
> >>>>>>>>>>>>>>> JSON_OBJECT(
> >>>>>>>>>>>>>>>     KEY 'users'
> >>>>>>>>>>>>>>>     VALUE JSON_ARRAYAGG(
> >>>>>>>>>>>>>>>         JSON_OBJECT(
> >>>>>>>>>>>>>>>             KEY 'name' VALUE "users"."name",
> >>>>>>>>>>>>>>>             KEY 'todos' VALUE JSON_ARRAYAGG(
> >>>>>>>>>>>>>>>                 JSON_OBJECT(
> >>>>>>>>>>>>>>>                     KEY 'description' VALUE
> >>>>>>> "todos"."description"
> >>>>>>>>>>>>>>>                 )
> >>>>>>>>>>>>>>>             )
> >>>>>>>>>>>>>>>         )
> >>>>>>>>>>>>>>>     )
> >>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>> FROM
> >>>>>>>>>>>>>>> "users"
> >>>>>>>>>>>>>>> LEFT OUTER JOIN
> >>>>>>>>>>>>>>> "todos" ON "users"."id" = "todos"."user_id";
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Checking the source, seems this is a blanket policy, not a
> >>>>>>>>>>>>>>> datasource-specific thing.
> >>>>>>>>>>>>>>> From a functional perspective, it doesn't feel like it's
> >> much
> >>>>>>>>>>>>> different
> >>>>>>>>>>>>>>> from JOINs
> >>>>>>>>>>>>>>> But I don't understand relational theory or DB
> >> functionality
> >>>>>>> in the
> >>>>>>>>>>>>> least,
> >>>>>>>>>>>>>>> so I'm not fit to judge.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Just curious why Calcite doesn't allow this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>
> >>
>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to