Re: [sqlite] Generated columns and COLLATE in the AS parens

Markus Winand Thu, 30 Jan 2020 12:44:18 -0800


> On 30 Jan 2020, at 18:20, Richard Hipp <d...@sqlite.org> wrote:
> 
> On 1/30/20, Markus Winand <markus.win...@winand.at> wrote:
>> 
>> Unfortunately, the “what would PostgreSQL do” approach doesn’t provide
>> guidance here.
> 
> Maybe it does.  PostgreSQL doesn't allow typeless columns, but it does
> allow columns with unspecified collating sequences, does it not?


If you do not specify a COLLATE clause in a column definition, the default 
collation of the effective character set is used.

The name ‘default’ is still a collation. Quoting 
https://www.postgresql.org/docs/current/collation.html:

        The collation of an expression can be the "default" collation, 
        which means the locale settings defined for the database. 


> What
> if you have a normal column X with some collating sequence C and then
> a generated column Y that as just "AS(X)”.

It depends on whether the generated column specifies a <data type> or not (see 
below).


>  If you do comparisons on
> column Y, which collating sequence does it use - the default or C?
> 
> Can you run that experiment for us?

David Raymond did.

Here is what happens, split into the three relevant cases.

> testing=> create table foo (
> testing(> a text,
> testing(> b text collate "C",
> testing(> c text collate "en-US-x-icu”,
> […]
> attnum | attname | attgenerated |  collname
> --------+---------+--------------+-------------
>      1 | a       |              | default
>      2 | b       |              | C
>      3 | c       |              | en-US-x-icu

Pretty obvious.

The next columns:

> testing(> d text generated always as (a) stored,
> testing(> e text generated always as (b) stored,
> testing(> f text generated always as (c) stored,

>      4 | d       | s            | default
>      5 | e       | s            | default
>      6 | f       | s            | default

The generated column definitions mention a <data type>, in that case the types 
character set’s default collation is used — coincidentally called “default”. 
Thus, it is the same case as column “a”.

Finally:

> testing(> g text collate "C" generated always as (c) stored,
> testing(> h text generated always as (c collate "C") stored,
> testing(> i text collate "C" generated always as (c collate "en-US-x-icu") 
> stored
> testing(> );

>      7 | g       | s            | C
>      8 | h       | s            | default
>      9 | i       | s            | C

Again, <data type> is specified, thus this collation is used. The collation of 
“h” is not “C” because there is an explicit type definition “text”, which 
includes a character set and a default collation.

The collation of the expression is really only relevant if there is no <data 
type> (and no COLLATE outside the expression).

> 
> Or maybe you are thinking the collating sequence of the expression in
> the AS clause should only be carried through into the generated column
> if it is explicitly stated, and not implied?

There is only one case when the type of the expression, including the character 
set and collation should be carried through into the generated column: if the 
generated column doesn’t explicitly define a type.

> 
> What happens if there is a collating sequence specified in the AS
> clause and also another collating sequence on the column definitions?
> 
>  CREATE TABLE t1(x TEXT, y TEXT AS (x COLLATE nocase) COLLATE rtree);
> 
> Which collating sequence should be used for "SELECT * FROM t1 WHERE y='xyz';"?

There are three places where a COLLATE clause in the definition of a generated 
column is allowed:

(1) After the data type (see 6.1, <predefined type> in the BNF)
(2) Inside the expression (see 6.31, <character factor> in the BNF)
(3) At the very end (see 11.4, <column definition> in the BNF).

There is a syntax rule prohibiting (1) and (3) being used at the same time 
(11.4 SR12b). The same rule says that the effect of (1) and (3) is the same:

>       • Otherwise, <collate clause> shall not be both specified in <data 
> type> and immediately contained in <column definition>. If <collate clause> 
> is immediately contained in <column definition>, then it is equivalent to 
> specifying an equivalent <collate clause> in <data type>. 

This is the case you were just are asking about.

Case (2) might affect the collation of the expression, which is only relevant 
in case there is neither (1) or (3) specified. In that case, also the collation 
is taken from the expression—via the data type (11.4 SR13c).

> 13)  The declared type of the column is
> Case:
>       • a)  If <data type> is specified, then that data type. If <generation 
> clause> is also specified, then the declared type of <generation expression> 
> shall be assignable to the declared type of the column.
>       • b)  If <domain name> is specified, […]
>       • c)  If <generation clause> is specified, then the declared type of GE.

As you see, as soon as a <data type> is specified, the collation of the 
expression is irrelevant. This is what can be demonstrated in PostgreSQL. 
Skipping <data type> is currently not possible in PostgreSQL, thus PostgreSQL 
cannot be used to demonstrate c).

> Regarding WWPD: If you can convince Tom Lane and/or Bruce Momjian to
> send me an email that says "An explicit collating sequence at the
> top-level of a GENERATED ALWAYS AS clause should be carried through as
> the default collating sequence of the generated column itself", then
> I'll change it.  :-)

I won’t because that’s not how it works.

It doesn’t matter if there is a COLLATE clause in the expression nor where it 
appears. My example was just making it explicit.

The only thing that matters is whether there is a <data type> or COLLATE clause 
outside the expression or not. If neither of them is present, the collation of 
the expression is the relevant one for the generated column. It doesn’t matter 
where the expression got its collation from.

Even if there is no COLLATE clause in the expression, the expression’s 
collation should still be taken if there is neither a <data type> nor a COLLATE 
outside the expression—like in this example:

sqlite> CREATE TABLE tmp (base_column text COLLATE NOCASE, gen_column GENERATED 
ALWAYS AS (base_column));
sqlite> INSERT INTO tmp VALUES ('a'), ('A'), ('b'), ('B');
sqlite> select * from tmp order by base_column;
a|a
A|A
b|b
B|B
sqlite> select * from tmp order by gen_column;
A|A
B|B
a|a
b|b

The expected result would be the same order for both queries (NOCASE). IMHO: 
that would also be the only reasonable expectation. Similar to a “pass through” 
column of a view:

sqlite> CREATE VIEW tmpv AS SELECT * FROM tmp;
sqlite> SELECT * FROM tmpv ORDER BY base_column;
a|a
A|A
b|b
B|B

As soon as you add a <data type> to the generated column, the expression's 
collation becomes irrelevant.

   gen_column TEXT GENERATED ALWAYS AS (base_column)

Adding “TEXT” as <data type> activates the branch a) of the above quoted SR13.

In lack of a <data type> and COLLATE outside the expression, brach c) triggers. 
This is the case I’m talking about. Whether that collation of the expression is 
implicit or explicit doesn’t matter.

-markus

> 
> -- 
> D. Richard Hipp
> d...@sqlite.org
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Generated columns and COLLATE in the AS parens

Reply via email to