[jira] [Created] (CASSANDRA-17811) CQL aggregation functions on collections

Jira Thu, 11 Aug 2022 09:38:04 -0700

Andres de la Peña created CASSANDRA-17811:
---------------------------------------------


             Summary: CQL aggregation functions on collections
                 Key: CASSANDRA-17811
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17811
             Project: Cassandra
          Issue Type: Bug
          Components: CQL/Semantics
            Reporter: Andres de la Peña


It has been found during CASSANDRA-8877 that CQLS's aggregation functions 
{{{}max{}}}, {{min}} and {{count}} can be applied to collections, but the 
result is returned as a blob. For example:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, l list<int>);
INSERT INTO t(k, l) VALUES (0, [1, 2, 3]);
INSERT INTO t(k, l) VALUES (1, [10, 20, 30]);
SELECT max(l) FROM t;

 system.max(l)
------------------------------------------------------------
 0x00000003000000040000000a0000000400000014000000040000001e
{code}
I'm not sure on whether the function shouldn't be supported for collections, or 
it should be supported but the result is wrong.

In the example above, the returned blob is the serialized value of {{{}[10, 20, 
30]{}}}, which is the right one according to the list comparator. I think this 
happens because the matched version of the function is the one for {{{}(blob) 
-> blob{}}}. We would need a {{(list<int>) -> list<int>}} function instead, but 
this function doesn't exist.

It would be quite easy to add versions of the {{{}max{}}}, {{min}} and 
{{count}} functions for every type of collection ({{list<int>}}, 
{{list<text>}}, {{map<int, int>}}, {{map<int, text>}}, etc.). The downside of 
this approach is that it would increase the number of aggregation functions 
kept in memory from 82 to 2722, if my maths are right. This is quite an 
increase, mainly due to the many possible combinations of the {{map}} type. 
[Here|https://github.com/adelapena/cassandra/commit/e3ba3c2dc36ce58d06942078c708ffb93eb3cd84]
 is a quick, incomplete prototype of the approach.

Also, I'm not sure that applying those aggregation functions to collections is 
very useful in practice. Thus, an alternative approach would be just forbidding 
them, considering them not supported. I don't think it would be a problem for 
backward compatibility since no one has complained about the current behaviour, 
and we might well consider that the original intent was not to allow 
aggregation on collections. At least, there aren't any tests for it, and I 
can't find any documentation about it either.

Another idea that comes to mind is that we could change the meaning of those 
functions to aggregate the values within the collection, instead of aggregating 
the rows. In that case, the behaviour would be:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, l list<int>);
INSERT INTO t(k, l) VALUES (0, [1, 2, 3]);
INSERT INTO t(k, l) VALUES (1, [10, 20, 30]);
SELECT max(l) FROM t;

 k | system.max(l)
---+-----------
 1 | 30
 0 | 3
{code}
Of course we could have separate function names for that type of collection 
aggregations, like {{{}collectionMax{}}}, {{{}maxItem{}}}, or something like 
that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-17811) CQL aggregation functions on collections

Reply via email to