I went through the SQL spec more and here is what I have gathered along with a proposal for Derby's collation type.
First of all, like Rick mentioned, SQL spec talks about character string types having collation type and collation derivation associated with them (SQL spec Section 4.2.2 Comparison of character strings). If collation derivation says explicit or implicit, then it means that there is a valid collation type associated with the charcter string type. If the collation derivation is none, then it means that collation type can't be established for the character string type. 1)Collation derivation will be explicit if COLLATE clause has been used for character string type (this is not a possibility for Derby 10.3, because we are not planning to support SQL COLLATE clause in this release). 2)Collation derivation will be implicit if the collation can be determined from the context. eg CREATE TABLE t1(c11 char(4)) then c11 will have collation of user requested collation ie TERRITORY_BASED/UCS_BASIC. Another eg, TRIM(c11) then the result character string of TRIM operation will have collation of the operand, c11. 3)Collation derivation will be none if the aggregate methods are dealing with character strings with different collations (Section 9.3 Data types of results of aggregations Syntax Rule 3aii). SQL spec also talks in various sections about default collation associated with character repertoire, character set, SQL schema, SQL session. These defaults are used to determine the collation requirement for character string types in ambiguous places. (In a SQL implementation of SQL spec's collation clause support, if the rules for the collation determination is not what the application wants, then the user can use the COLLATE clause to override the SQL spec behavior.) For string literals, which has been the focus of recent collation discussion on the Derby list, SQL spec specifies(Section 5.3 <literal> Syntax Rule 15) that it's collation will be the collation of the character set (which is the default collation of the character set. Every character set defined in the SQL implementation is required to have a default collation). In the absence of the COLLATE clause support in Derby 10.3, I do not think we can follow the SQL spec for string literals' collation type. If we decide to make user defined collation(through the JDBC url attribute COLLATION) as Derby's default collation for the character set, then that would mean that a 10.3 db with COLLATION attribute as TERRITORY_BASED will always have string literals with TERRITORY_BASED collation. This will break our metadata queries which does comparison of SYS character column against string literals. The SYS character column will have UCS_BASIC collation and string literals will have TERRITORY_BASED collation and during the comparison, Derby will end up throwing exception because character strings with different collation can't be compared. If Derby 10.3 had support for COLLATE clause, then we could implement SQL spec behavior for string literal and let metadata queries use the COLLATE clause and users could use the COLLATE clause in their queries against system tables when using string literals. So, in the absence of the COLLATE clause in Derby 10.3, what I am proposing is string literals have a collation type of UNKNOWN. When they get used in a collation operation, these UNKNOWN collation types will get their collation from the other operands involved in the operation (requirement here would be that all the operands whose collation type is not UNKNOWN, will have the same collation type associated with them. This requirement comes from Section 9.13 Collation determination, Syntax Rule 3e). If all of the operands in collation operation have collation type as UNKNOWN, then the collation chosen would be the one specified through the COLLATION attribute on the JDBC url. This last line makes Derby's collation behavior not match SQL spec Section 9.13 Collation determination, Rule 2 which says "At least one operand shall have a declared type collation." But again, I think in the absence of SQL's COLLATE clause, we have to break some rules to make Derby more flexible. If we go with my proposal for string literal, then a comparison like 'aa' < 'ba' will have the 2 operands with UNKNOWN collation, and Derby's collation algorithim at collation time will choose the value specified for COLLATION attribute in the JDBC url. For function that returns a string datatype, we will have the collation type for that string datatype as UNKNOWN (same as string literals). Other than this, I think Derby's collation type of UNKNOWN has a clear mapping with "none" value of collation derivation in SQL spec. Following covers collation type association for various operations involving character string types in Derby 10.3 1)String literal - collation type will be UNKNOWN. 2)<character string type> (SQL spec section 6.1 <data type> Syntax Rule 3) - collation type will be the one specified through the COLLATION attribute on the JDBC url 3)<cast specification> If the data type being CASTed and the result data type are both character string type, then collation type of result character string type will be same as the collation type of CAST operand(Section 6.12<cast specification> Syntax Rules 9,10). If the data type being CASTed is not character string type but the result data type is, then the collation of the result character string type will be UNKNOWN (Section 6.12 <cast specification> Syntax Rule 10 says that it should be collation of the character set. I think we shouldn't implement this SQL spec behavior because in the absence of support for COLLATE clause, we probably do not want to get forced into a collation type which will not work in all the cases. This is the same argument that I used earlier for having collation type of UNKNOWN for string literals). 4)CHAR, VARCHAR functions - same as <cast specification> specified above, ie if the first argument to CHAR/VARCHAR function is a character string type, then the result character string of CHAR/VARCHAR will have the same collation as the first argument. If the first argument is not a character string type, then the collation of the result character string type will be UNKNOWN. 5)Aggregate operators involving all character string type operands(Concatenation, CASE, NULLIF, COALESCE) will follow SQL spec Section 9.3 Data types of results of aggregations. In other words, if all the operands have the same collation associated with them, then the collation of result character string type will be same. But if operands of different collation types are involved, then the result character string type will have collation type of UNKNOWN. 6)Result character string types from UPPER, LOWER, TRIM(LTRIM, RTRIM), SUBSTR will have the same collation as their operand. This comes from SQL spec Section 6.29 <string value function> Syntax Rules 8, 8, 11d, 4 respectively) 7)For CURRENT_USER, SESSION_USER, SYSTEM_USER, SQL spec Section 6.4 Syntax Rule 4 says that their collation type is the collation of character set SQL_IDENTIFIER. SQL spec Section 4.2.6 Collations talks about SQL_IDENTIFIER's collation being implementation defined. Based on this, we can decide the collation for these USER functions to be UNKNOWN. Again, the argument here is same as used for string literals. 8)CURRENT SCHEMA and CURRENT SQLID should return character string type with collation of UNKNOWN. I think that covers all different uses of character string type in Derby. As always, any feedback will be very helpful, Mamta On 3/23/07, Mamta Satoor <[EMAIL PROTECTED]> wrote:
Thanks a bunch for the pointers, Rick. What is intriguing is the default collation of the string data type. For a string literal, if it is used in a comparison operation with SYS schema character column, then logically, the string literal should have collationof UCS_BASIC. But if the string literal is used in comparison with user schema character column, then it's collation should be whatever is defined for that user schema. So, it sounds like the default collation of character set will be different depending on the schema. SQL spec does say that there is a collation descriptor associated with schema descriptor. And may be that is how character set's default collation will be determined. I need to spend more time on SQL spec to understand this completely. Mamta On 3/23/07, Rick Hillegas <[EMAIL PROTECTED] > wrote: > > Hi Mamta, > > This is my understanding of what these words mean, based on a quick > googling of industry practices. For instance, see > > http://www.nocomsoftware.se/p5745/files/whatsnew-sb-10.0.0.htm > http://msdn2.microsoft.com/en-us/library/ms179886.aspx > > Explicit - This means that a COLLATE clause in the statement forces the > server to use a particular collation. > > Implicit - This means that your statement mentions a column without > using a COLLATE clause. The column itself has a collation which was > determined when the table was created. > > None - This case arises when you use SQL operators to combine two > columns which have different collations. For example "select > frenchColumn || englishColumn from ...". In this case the server cannot > figure out which collation to use. > > There is also a concept of a default collation for a datatype. As I read > the SQL standard, part 2, section 4.2.2, I see the following: > > 1) A string datatype has a default character set associated with it. > > 2) That character set, in turn, has a distinguished collation associated > with it. > > 3) That collation is the default collation of the string datatype. That > is, if you create a column of that datatype and you don't include a > COLLATE clause, then the column has that collation. Similarly, if you > declare a function that returns a string datatype and you don't include > a COLLATE clause, then the function returns a string having that default > > collation. > > Hope this helps... > > Regards, > -Rick > > > > Mamta Satoor wrote: > > I am looking at the SQL spec to see how it deals with the problem of > > different collation types, which they call as explicit, implicit and > > none. Hopefully, that will make it easier to come up with a logic for > > deducting correct collation type for non-trivial cases like COLLATE, > > TRIM, string literal, etc. > > > > Mamta > > > > > > On 3/22/07, *Daniel John Debrunner* <[EMAIL PROTECTED] > > <mailto: [EMAIL PROTECTED]>> wrote: > > > > Mamta Satoor wrote: > > > Before talking about functions, I think it will be better to > > first talk > > > about string literals and their collation determination. > > > > > > SQL spec section 5.3 <literal>, Syntax Rule 15) says "The > > declared type > > > collation of a <character string literal> is the character set > > > collation, and the collation derivation is implicit." > > > > > > Based on this, when a string literal (collation type UNKNOWN) is > > getting > > > used in a collation method with another operand as UCS_BASIC > > collation, > > > then the collation type of string literal will be UCS_BASIC. > > Similar > > > rule for operand with TERRITORY_BASED. In a case where, > > collation types > > > of all the operands is UNKNOWN, at collation time, it can be > > assumed to > > > be whatever is defined for user defined character columns. This > > will be > > > similar to the example given by Rick for implicit collation type > > when > > > talking about CAST ie > > > CREATE TABLE t1 (c11 char(1) default 'a') In this example, the > > collation > > > type of DTD associated with 'a' will be implicitly whatever is > > defined > > > at the database level for COLLATION. > > > > > > Hope this answers the question about string literals. > > > > Kind of, I looked up the definition of "collation derivation is > > implicit" in section 4.2.2 of the standard and at first reading it > > wasn't obvious to me what it meant. > > > > I know I suggested the 'collation type UNKNOWN' but I hadn't > > looked into > > the SQL standard in detail, and now I'm wondering if the UNKNOWN > > concept is a good idea. Since the SQL standard already defines a > model > > for how collations are defined it might be wise to follow the > required > > model and naming. Not sure what that would mean exactly, but it > seems > > like each character expression can have a derivation of explicit, > > implicit or none. These may be better ways to carry state rather > than > > unknown. Unless of course there's a clear mapping between unknown > and > > the sql standard definition. > > > > Dan. > > > > > >
