Re: [sqlite] Generated columns and COLLATE in the AS parens

Markus Winand Fri, 31 Jan 2020 02:30:36 -0800

Let me first explain how collations work in standard SQL. I think my answers 
below make more sense then.

In the SQL standard….

- character string **types** have properties such as fix-length vs. variable 
length, the length (limit), character set and also the collation.

- columns have types, which include the collation if it is a character string 
type.
  However, it is not the column that has a collation. The column has a type, 
which might have a collation.

- values have a static type (the so-called “declared type”). Static means it is 
determined at “compile time” from the syntax and the data dictionary.

- Expressions also have a declared type. More elaborate: the result value of an 
expression has a type that is statically determined (again  the “declared 
type”).

- expressions of which the declared type is a character string type, will also 
have all the properties of character string types (including the collation).

- When character strings are compared, the effective collation is determined 
from the declared types of the operands.
   The SQL standard defines rules how to do that (e.g. in 9075-2 9.15 
"Collation determination”)
   The most important rule is the "collation derivation” order: explicit, 
implicit, none. That’s the rule that says
   the COLLATE clause on expressions (“explicit” derivation) is stronger than 
the COLLATE clause following type names (“implicit” derivation).

But note that the last statement is not the reason the collation of the 
generation expression takes precedence of the data type — IT DOESN’T!

As far as I know, this is basically unchanged sind SQL-92. If you do not have 
access to the current release of the standard (of 2016, to which my references 
apply), you can also have a look at SQL-92 here (search for "4.2  Character 
strings”):

https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

Generated columns where introduced with SQL:2003 and are thus not in the linked 
document.

Putting these mechanics of working with collations in context of generated 
columns:

- Generated columns need a type like any other column.
  If that type happens to be a character string type, the type includes the 
collation as part of the types properties.

- Even though generated columns need a type, they allow skipping the type in 
the definition of the generated column—that’s very unique to generated columns.
  E.g. specifying a DEFAULT clause does not lift the requirement to explicitly 
state the type of the column.

- If a generated column doesn’t have an explicit data type mentioned, the the 
data type of the result of the expression is used.
  Whatever data type that is, whatever properties that type has.
  If it happens to be a character string type, it will also have a “collation” 
property.

However, the declared type of the expression (including its properties) is only 
relevant if the column definition doesn’t explicitly state a data type.

These are two different cases in standard SQL:

  C1 TEXT GENERATED ALWAYS AS (…)
  C2      GENERATED ALWAYS AS (…)

The type of C1 is TEXT, including all its default properties.
The type of C2 is the type of the result of the expression, including all its 
properties.

The “what’s new” paper for SQL:2003 mentions that case explicitly.

http://sigmodrecord.org/publications/sigmodRecord/0403/E.JimAndrew-standard.pdf

The example on the last page:

> CREATE TABLE EMPLOYEES (
>  EMP_ID INTEGER,
>  SALARY DECIMAL(7,2),
>  BONUS DECIMAL(7,2),
>  TOTAL_COMP GENERATED ALWAYS AS (SALARY + BONUS)
>  )
> 
> TOTAL_COMP is a generated column of the EMPLOYEES table. The data type of the 
> TOTAL_COMP is the data type of the expression (SALARY_BONUS).

I think that’s a typo and should read (SALARY + BONUS).

> Users may optionally specify a data type for a generated column, in which 
> case the specified data type must match with the data type of the associated 
> expression.

The last part is not 100% in line with the current standard. The types don’t 
need to match, they need to be assignable. I don’t know if this is just a 
simplification for this paper or if SQL:2003 really hat that requirement.

Further comments to your statements below…

> On 30 Jan 2020, at 23:06, Keith Medcalf <kmedc...@dessus.com> wrote:
> 
> 
> On: Wednesday, 29 January, 2020 06:45, Markus Winand 
> <markus.win...@winand.at> wrote:
> 
>> I think there might be a glitch in the way SQLite 3.31.x derives the
>> collation information from the expression of a generated column.
> 
>> In particular, COLLATE inside the AS parens seems to be ignored, but it
>> is honoured after the parens:
> 
> Carrying the COLLATE from an expression into the column definition is 
> incorrect.  The definition of a generated column is:
> 
> <identifier> [type affinity] [GENERATED ALWAYS AS (<expression>)] [COLLATE 
> <identifier>] [<constraints> ...]
> 
> so why would the so including a COLLATE as part of the expression applies to 
> the expression, and not to the column.  Except for the ordering of 
> <identifier> which must be first, followed by the type which must come 
> second, the other bits can appear in pretty much any ordering you heart 
> desires (it is a Dim Sum of clauses).  
> 
> If the <type affinity> of the column not specified then its "type affinity" 
> is BLOB (None).

This is where the SQL standard is different: if there is no type, the column 
gets the type of the expression (ISO/IEC 9075-2, 11.4 SR13c)

> If no COLLATE is specified for the column, then the default collation BINARY 
> applies.

Again, the SQL standard is different: If there is no COLLATE clause **and** not 
data type specified, the collation is inherited via the data type of the result 
of the expression.

If the column definition explicitly states a data type, that type comes with a 
collation, which takes precedence of the type of the expression.

> If no NULL constraint is specified then NULL values are permitted.

If not NOT NULL constraint….

> If no DEFAULT expression is specified then the default value is NULL.
> 
> Why would you think that the COLLATE applied to the evaluation of an 
> expression would "carry out" to the column itself? 

Because this is what is written in the SQL standard. And it make sense when 
collations are a property of character string types, which they are in the SQL 
standard.

> Especially in light of the fact that the type/collation of the expression in 
> a "DEFAULT (<expression>)" does not carry out to the column definition?

Because you cannot use DEFAULT without specifying a type name at the very same 
time (in standard SQL).

11.4, SR 11) If <generation clause> is omitted, then either <data type> or 
<domain name> shall be specified. 

If it would be possible to skip the data type when a DEFAULT clause is present, 
I guess the same rules as for generated columns would apply (take the type, 
including all its properties) of the expressions result.

> 
> The collation sequence applies to "comparison" operations of TEXT values, and 
> not to anything else (it is ignored for all 
> other value types).  It is an attribute of a column (like column affinity) 
> and not an attribute of a value.

It is an attribute of character string **types**.

Character string **columns** have collations because their type has one.

Values have a collation because their type has one.

> 
> In other words,
> 
> x AS (y == 'yahoo' collate nocase)
> 
> applies the affinity nocase to the evaluation of the == comparison operation. 
>  The result of evaluation of the expression is a data value with a concrete 
> type (integer) and no associated collation.  This value then has whatever 
> column affinity is specified applied to it and inherits the collation of the 
> containing column.

But the case was bringing up is that the generated column doesn’t have a type 
declared.

> 
> x AS (y collate nocase)
> 
> is the same thing as
> 
> x AS (y)
> 
> since there is no "comparison" operation in the expression to which the 
> collation can be applied (it is a useless attribute to the expression).

Assuming Y is a character string and the collation of Y is not NOCASE, the type 
of both expressions is different (but not the value). As the type contains the 
collation, it is not necessarily useless.

>  The result of the expression is a value of the concrete type of y with no 
> associated collation.

The result of the expression “(Y)” has the very same declared type as Y itself. 
If Y is a character string, the result of the expression has the same collation 
as Y itself.

>  This value then has the whatever column affinity is specified applied to it 
> and inherits the collation of the containing column.

If the column has a defined data type or an explicit COLLATE outside the 
expression, these collations will be used. Otherwise: the collation of the 
declared type of the expression is used because the entire data type of the 
result of the expression is used.

> 
> ie, "values" do not have a collation associated with them, column definitions 
> and comparison operators on text have collation sequences associated with 
> them.

Values have a declared types, character string types have collations.

Nobody cares about the collation of a character string until it takes part in a 
comparison.

-markus

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Generated columns and COLLATE in the AS parens

Reply via email to