[sqlite] "Distinct" and "Order by" in "group_concat"

Jean-Luc Hainaut Fri, 13 May 2016 21:03:09 +0200

Hi,

I have two questions/observations/suggestions related to SQLite function 
"group_concat". 
This function is extremely useful since it replaces in an elegant and concise 
way quite tricky procedures. However, it misses two features of which I would 
like to discuss the work-around.


Let's consider table P(PiD,Age,City), giving, for a set of persons, their age 
and city. PiD is the PK.

create table P(PiD integer not null primary key,Age integer,City text);
insert into P values (1,35,'B'),(2,32,'B'),(3,35,'A'),(4,35,'B'),(5,32,'A');

+-----+-----+------+
| PiD | Age | City |
+-----+-----+------+
| 1   | 35  | B    |
| 2   | 32  | B    |
| 3   | 35  | A    |
| 4   | 35  | B    |
| 5   | 32  | A    |
+-----+-----+------+


1. The case of the DISTINCT clause in "group_concat" function

I want to extract a table that assigns to each age the list of cities in which 
some persons of this age live:

select Age,group_concat(City) as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B,A    |
| 35  | B,A,B  |
+-----+--------+

Duplicates can be removed through quantifier "distinct":

select Age,group_concat(distinct City) as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B,A    |
| 35  | B,A    |
+-----+--------+

Now, I would like to change the default separator (",") for ";". No problem 
"with duplicates":

select Age,group_concat(City,';') as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B;A    |
| 35  | B;A;B  |
+-----+--------+

But this does not work when I ask to reduce duplicates:

select Age,group_concat(distinct City,';') as Cities
from P
group by Age;

We get the error message: "DISTINCT aggregates must have exactly one argument". 
Clearly, the analyzer doesn't consider the second argument as the separator but 
as an additional element to concatenate, which is invalid (the documentation 
says "In any aggregate function that takes a single argument, that argument can 
be preceded by the keyword DISTINCT").

This can be solved, but at the cost of increased complexity (and perhaps 
execution time), by removing duplicates in a subquery "from" clause ...:

select Age,group_concat(City,';') as Cities 
from (select distinct Age,City
      from P)
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B;A    |
| 35  | B;A    |
+-----+--------+

... or in a subquery in the select-list:

select Age,(select group_concat(City,';') 
            from P
            where Age = AP.Age
            group by Age) as Cities
from P AP 
group by Age;

.. or even by replacing default ',' with ';' in the resulting concatenated 
list, provided no comma appears in the concatenated elements:

select Age,replace(group_concat(distinct City),',',';') as Cities
from P group by Age;

A more drastic solution would be to code a new, user-defined, aggregate 
function that simulates a "group_concat(distinct ...,separator)" function.


My question/suggestion
----------------------------------
Is there a better way to reduce duplicates with the current version of SQLite? 
Could we imagine, in a future version of SQLite, an extension of "group_concat" 
with both "distinct" and user-defined separator? If we can, I will support it!


2. The case of the ORDER BY clause in "group_concat" function

My second concern is about the way to sort the items of the "group_concat" 
result, as is possible in some other DBMS. E.g., how to get such sorted lists 
in SQLite:

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | A,B    |
| 35  | A,B,B  |
+-----+--------+
or
+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | A,B    |
| 35  | A,B    |
+-----+--------+

The usual technique (generally found on the web) to sort the lists is through a 
subquery "from" clause in which table P (or a projection of it) is pre-sorted 
on the elements to aggregate:

select Age,group_concat(City) as Cities 
from (select Age,City 
      from P
      order by City)
-- or (more coercive for the optimizer): order by Age,City)
group by Age;

This works fine in all the tests I have caried out. 
However, this technique relies on the fact that the order is preserved and is 
exploited when executing the outer query. 
Though this hypothesis seems quite natural, theoretically, it depends on 
multiple factors: size of the source data, indexes, other clauses of the source 
query that may require different row ordering, current strategies of the 
optimizer (that may evolve in next versions). In short, unless the hypothesis 
formulated above can be taken for certain and definitive in the long term, this 
technique may be felt unstable and unsecure.

My question
-----------------
Can we be made sure that this form will always, in all cases, produce the 
desired element ordering in the concatenated list?

Thanks for your comments.

Jean-Luc Hainaut

[sqlite] "Distinct" and "Order by" in "group_concat"

Reply via email to