Hi Mamta,

Thanks for this extensive write-up. This helps me puzzle through the issues although I'm afraid I'm still muddled. Some comments follow inline.

Mamta Satoor wrote:
lots of good stuff ...
SQL spec also talks in various sections about default collation associated with character repertoire, character set, SQL schema, SQL session. These defaults are used to determine the collation requirement for character string types in ambiguous places. (In a SQL implementation of SQL spec's collation clause support, if the rules for the collation determination is not what the application wants, then the user can use the COLLATE clause to override the SQL spec behavior.) For string literals, which has been the focus of recent collation discussion on the Derby list, SQL spec specifies(Section 5.3 <literal> Syntax Rule 15) that it's collation will be the collation of the character set (which is the default collation of the character set. Every character set defined in the SQL implementation is required to have a default collation). In the absence of the COLLATE clause support in Derby 10.3, I do not think we can follow the SQL spec for string literals' collation type. If we decide to make user defined collation(through the JDBC url attribute COLLATION) as Derby's default collation for the character set, then that would mean that a 10.3 db with COLLATION attribute as TERRITORY_BASED will always have string literals with TERRITORY_BASED collation. This will break our metadata queries which does comparison of SYS character column against string literals. The SYS character column will have UCS_BASIC collation and string literals will have TERRITORY_BASED collation and during the comparison, Derby will end up throwing exception because character strings with different collation can't be compared. If Derby 10.3 had support for COLLATE clause, then we could implement SQL spec behavior for string literal and let metadata queries use the COLLATE clause and users could use the COLLATE clause in their queries against system tables when using string literals.
I wonder if we could just implement the COLLATE clause for string literals. Would that be a lot of work? The metadata queries could be adjusted to use the COLLATE clause. I think we should avoid violating the SQL standard if we can.
So, in the absence of the COLLATE clause in Derby 10.3, what I am proposing is string literals have a collation type of UNKNOWN. When they get used in a collation operation, these UNKNOWN collation types will get their collation from the other operands involved in the operation (requirement here would be that all the operands whose collation type is not UNKNOWN, will have the same collation type associated with them. This requirement comes from Section 9.13 Collation determination, Syntax Rule 3e). If all of the operands in collation operation have collation type as UNKNOWN, then the collation chosen would be the one specified through the COLLATION attribute on the JDBC url. This last line makes Derby's collation behavior not match SQL spec Section 9.13 Collation determination, Rule 2 which says "At least one operand shall have a declared type collation." But again, I think in the absence of SQL's COLLATE clause, we have to break some rules to make Derby more flexible. If we go with my proposal for string literal, then a comparison like 'aa' < 'ba' will have the 2 operands with UNKNOWN collation, and Derby's collation algorithim at collation time will choose the value specified for COLLATION attribute in the JDBC url. For function that returns a string datatype, we will have the collation type for that string datatype as UNKNOWN (same as string literals). Other than this, I think Derby's collation type of UNKNOWN has a clear mapping with "none" value of collation derivation in SQL spec.
I'm a little lost here. Could you explain what is meant by the UNKNOWN collation? I don't see this collation described in either the SQL standard or in the spec attached to DERBY-1478.
more good stuff ...

7)For CURRENT_USER, SESSION_USER, SYSTEM_USER, SQL spec Section 6.4 Syntax Rule 4 says that their collation type is the collation of character set SQL_IDENTIFIER. SQL spec Section 4.2.6 Collations talks about SQL_IDENTIFIER's collation being implementation defined. Based on this, we can decide the collation for these USER functions to be UNKNOWN. Again, the argument here is same as used for string literals.
It seems to me that SQL_IDENTIFIERs have collation UCS_BASIC. This is how they behave when it comes to preventing two schema objects from having the same name. And this is what forces a collation of UCS_BASIC on the string columns in the system tables. I'm not sure I understand bullet (7). I hope it is not saying that SQL_IDENTIFIERs have UNKNOWN collation. I think we will get into a lot of trouble if we try to force SQL_IDENTIFIERs to have some collation other than UCS_BASIC.

Thanks,
-Rick

Reply via email to