Re: Collation feature discussion

Mamta Satoor Sun, 25 Mar 2007 23:43:45 -0800

I went through the SQL spec more and here is what I have gathered along with
a proposal for Derby's collation type.


First of all, like Rick mentioned, SQL spec talks about character string
types having collation type and collation derivation associated with them
(SQL spec Section 4.2.2 Comparison of character strings). If collation
derivation says explicit or implicit, then it means that there is a valid
collation type associated with the charcter string type. If the collation
derivation is none, then it means that collation type can't be established
for the character string type.
1)Collation derivation will be explicit if COLLATE clause has been used for
character string type (this is not a possibility for Derby 10.3, because we
are not planning to support SQL COLLATE clause in this release).
2)Collation derivation will be implicit if the collation can be determined
from the context. eg CREATE TABLE t1(c11 char(4)) then c11 will have
collation of user requested collation ie TERRITORY_BASED/UCS_BASIC. Another
eg, TRIM(c11) then the result character string of TRIM operation will have
collation of the operand, c11.
3)Collation derivation will be none if the aggregate methods are dealing
with character strings with different collations (Section 9.3 Data types of
results of aggregations Syntax Rule 3aii).

SQL spec also talks in various sections about default collation associated
with character repertoire, character set, SQL schema, SQL
session. These defaults are used to determine the collation requirement for
character string types in ambiguous places. (In a SQL implementation of SQL
spec's collation clause support, if the rules for the collation
determination is not what the application wants, then the user can use the
COLLATE clause to override the SQL spec behavior.) For string literals,
which has been the focus of recent collation discussion on the Derby list,
SQL spec specifies(Section 5.3 <literal> Syntax Rule 15) that it's collation
will be the collation of the character set (which is the default collation
of the character set. Every character set defined in the SQL implementation
is required to have a default collation).

In the absence of the COLLATE clause support in Derby 10.3, I do not think
we can follow the SQL spec for string literals' collation type. If we decide
to make user defined collation(through the JDBC url attribute COLLATION) as
Derby's default collation for the character set, then that would mean that a
10.3 db with COLLATION attribute as TERRITORY_BASED will always have string
literals with TERRITORY_BASED collation. This will break our metadata
queries which does comparison of SYS character column against string
literals. The SYS character column will have UCS_BASIC collation and string
literals will have TERRITORY_BASED collation and during the comparison,
Derby will end up throwing exception because character strings with
different collation can't be compared. If Derby 10.3 had support for COLLATE
clause, then we could implement SQL spec behavior for string literal and let
metadata queries use the COLLATE clause and users could use the COLLATE
clause in their queries against system tables when using string literals.

So, in the absence of the COLLATE clause in Derby 10.3, what I am proposing
is string literals have a collation type of UNKNOWN. When they get used in a
collation operation, these UNKNOWN collation types will get their collation
from the other operands involved in the operation (requirement here would be
that all the operands whose collation type is not UNKNOWN, will have the
same collation type associated with them. This requirement comes from
Section 9.13 Collation determination, Syntax Rule 3e). If all of the
operands in collation operation have collation type as UNKNOWN, then the
collation chosen would be the one specified through the COLLATION attribute
on the JDBC url. This last line makes Derby's collation behavior not
match SQL spec Section 9.13 Collation determination, Rule 2 which says "At
least one operand shall have a declared type collation." But again, I think
in the absence of SQL's COLLATE clause, we have to break some rules to make
Derby more flexible. If we go with my proposal for string literal, then a
comparison like 'aa' < 'ba' will have the 2 operands with UNKNOWN collation,
and Derby's collation algorithim at collation time will choose the value
specified for COLLATION attribute in the JDBC url. For function that returns
a string datatype, we will have the collation type for that string datatype
as UNKNOWN (same as string literals).

Other than this, I think Derby's collation type of UNKNOWN has a clear
mapping with "none" value of collation derivation in SQL spec.

Following covers collation type association for various operations involving
character string types in Derby 10.3
1)String literal - collation type will be UNKNOWN.
2)<character string type> (SQL spec section 6.1 <data type> Syntax Rule 3) -
collation type will be the one specified through the COLLATION attribute on
the JDBC url
3)<cast specification> If the data type being CASTed and the result data
type are both character string type, then collation type of result character
string type will be same as the collation type of CAST operand(Section
6.12<cast specification> Syntax Rules 9,10). If the data type being
CASTed is
not character string type but the result data type is, then the collation of
the result character string type will be UNKNOWN (Section 6.12 <cast
specification> Syntax Rule 10 says that it should be collation of the
character set. I think we shouldn't implement this SQL spec behavior because
in the absence of support for COLLATE clause, we probably do not want to get
forced into a collation type which will not work in all the cases. This is
the same argument that I used earlier for having collation type of UNKNOWN
for string literals).
4)CHAR, VARCHAR functions - same as <cast specification> specified above, ie
if the first argument to CHAR/VARCHAR function is a character string type,
then the result character string of CHAR/VARCHAR will have the same
collation as the first argument. If the first argument is not a character
string type, then the collation of the result character string type will be
UNKNOWN.
5)Aggregate operators involving all character string type
operands(Concatenation, CASE, NULLIF, COALESCE) will follow SQL spec Section
9.3 Data types of results of aggregations. In other words, if all the
operands have the same collation associated with them, then the collation of
result character string type will be same. But if operands of different
collation types are involved, then the result character string type will
have collation type of UNKNOWN.
6)Result character string types from UPPER, LOWER, TRIM(LTRIM, RTRIM),
SUBSTR will have the same collation as their operand. This comes from SQL
spec Section 6.29 <string value function> Syntax Rules 8, 8, 11d, 4
respectively)
7)For CURRENT_USER, SESSION_USER, SYSTEM_USER, SQL spec Section 6.4 Syntax
Rule 4 says that their collation type is the collation of character
set SQL_IDENTIFIER. SQL spec Section 4.2.6 Collations talks about
SQL_IDENTIFIER's collation being implementation defined. Based on this, we
can decide the collation for these USER functions to be UNKNOWN. Again, the
argument here is same as used for string literals.
8)CURRENT SCHEMA and CURRENT SQLID should return character string type with
collation of UNKNOWN.

I think that covers all different uses of character string type in Derby.

As always, any feedback will be very helpful,
Mamta

On 3/23/07, Mamta Satoor <[EMAIL PROTECTED]> wrote:


Thanks a bunch for the pointers, Rick.

What is intriguing is the default collation of the string data type. For a
string literal, if it is used in a comparison operation with SYS schema
character column, then logically, the string literal should have collationof 
UCS_BASIC. But if the string literal is used in comparison with user
schema character column, then it's collation should be whatever is defined
for that user schema. So, it sounds like the default collation of
character set will be different depending on the schema. SQL spec does say
that there is a collation descriptor associated with schema descriptor.
And may be that is how character set's default collation will be
determined. I need to spend more time on SQL spec to understand this
completely.

Mamta


 On 3/23/07, Rick Hillegas <[EMAIL PROTECTED] > wrote:
>
> Hi Mamta,
>
> This is my understanding of what these words mean, based on a quick
> googling of industry practices. For instance, see
>
> http://www.nocomsoftware.se/p5745/files/whatsnew-sb-10.0.0.htm
> http://msdn2.microsoft.com/en-us/library/ms179886.aspx
>
> Explicit - This means that a COLLATE clause in the statement forces the
> server to use a particular collation.
>
> Implicit - This means that your statement mentions a column without
> using a COLLATE clause. The column itself has a collation which was
> determined when the table was created.
>
> None - This case arises when you use SQL operators to combine two
> columns which have different collations. For example "select
> frenchColumn || englishColumn from ...". In this case the server cannot
> figure out which collation to use.
>
> There is also a concept of a default collation for a datatype. As I read
> the SQL standard, part 2, section 4.2.2, I see the following:
>
> 1) A string datatype has a default character set associated with it.
>
> 2) That character set, in turn, has a distinguished collation associated
> with it.
>
> 3) That collation is the default collation of the string datatype. That
> is, if you create a column of that datatype and you don't include a
> COLLATE clause, then the column has that collation. Similarly, if you
> declare a function that returns a string datatype and you don't include
> a COLLATE clause, then the function returns a string having that default
>
> collation.
>
> Hope this helps...
>
> Regards,
> -Rick
>
>
>
> Mamta Satoor wrote:
> > I am looking at the SQL spec to see how it deals with the problem of
> > different collation types, which they call as explicit, implicit and
> > none. Hopefully, that will make it easier to come up with a logic for
> > deducting correct collation type for non-trivial cases like COLLATE,
> > TRIM, string literal, etc.
> >
> > Mamta
> >
> >
> > On 3/22/07, *Daniel John Debrunner* <[EMAIL PROTECTED]
> > <mailto: [EMAIL PROTECTED]>> wrote:
> >
> >     Mamta Satoor wrote:
> >     > Before talking about functions, I think it will be better to
> >     first talk
> >     > about string literals and their collation determination.
> >     >
> >     > SQL spec section 5.3 <literal>, Syntax Rule 15) says "The
> >     declared type
> >     > collation of a <character string literal> is the character set
> >     > collation, and the collation derivation is implicit."
> >     >
> >     > Based on this, when a string literal (collation type UNKNOWN) is
> >     getting
> >     > used in a collation method with another operand as UCS_BASIC
> >     collation,
> >     > then the collation type of string literal will be UCS_BASIC.
> >     Similar
> >     > rule for operand with TERRITORY_BASED. In a case where,
> >     collation types
> >     > of all the operands is UNKNOWN, at collation time, it can be
> >     assumed to
> >     > be whatever is defined for user defined character columns. This
> >     will be
> >     > similar to the example given by Rick for implicit collation type
> >     when
> >     > talking about CAST ie
> >     > CREATE TABLE t1 (c11 char(1) default 'a') In this example, the
> >     collation
> >     > type of DTD associated with 'a' will be implicitly whatever is
> >     defined
> >     > at the database level for COLLATION.
> >     >
> >     > Hope this answers the question about string literals.
> >
> >     Kind of, I looked up the definition of "collation derivation is
> >     implicit" in section 4.2.2 of the standard and at first reading it
> >     wasn't obvious to me what it meant.
> >
> >     I know I suggested the 'collation type UNKNOWN' but I hadn't
> >     looked into
> >     the SQL standard in detail, and now I'm wondering if the UNKNOWN
> >     concept is a good idea. Since the SQL standard already defines a
> model
> >     for how collations are defined it might be wise to follow the
> required
> >     model and naming. Not sure what that would mean exactly, but it
> seems
> >     like each character expression can have a derivation of explicit,
> >     implicit or none. These may be better ways to carry state rather
> than
> >     unknown. Unless of course there's a clear mapping between unknown
> and
> >     the sql standard definition.
> >
> >     Dan.
> >
> >
>
>

Re: Collation feature discussion

Reply via email to