Hi Mamta,
Thanks for this extensive write-up. This helps me puzzle through the
issues although I'm afraid I'm still muddled. Some comments follow inline.
Mamta Satoor wrote:
lots of good stuff ...
SQL spec also talks in various sections about default collation
associated with character repertoire, character set, SQL schema, SQL
session. These defaults are used to determine the collation
requirement for character string types in ambiguous places. (In a SQL
implementation of SQL spec's collation clause support, if the rules
for the collation determination is not what the application wants,
then the user can use the COLLATE clause to override the SQL spec
behavior.) For string literals, which has been the focus of recent
collation discussion on the Derby list, SQL spec specifies(Section 5.3
<literal> Syntax Rule 15) that it's collation will be the collation of
the character set (which is the default collation of the character
set. Every character set defined in the SQL implementation is required
to have a default collation).
In the absence of the COLLATE clause support in Derby 10.3, I do not
think we can follow the SQL spec for string literals' collation type.
If we decide to make user defined collation(through the JDBC url
attribute COLLATION) as Derby's default collation for the character
set, then that would mean that a 10.3 db with COLLATION attribute as
TERRITORY_BASED will always have string literals with TERRITORY_BASED
collation. This will break our metadata queries which does comparison
of SYS character column against string literals. The SYS character
column will have UCS_BASIC collation and string literals will have
TERRITORY_BASED collation and during the comparison, Derby will end up
throwing exception because character strings with different collation
can't be compared. If Derby 10.3 had support for COLLATE clause, then
we could implement SQL spec behavior for string literal and let
metadata queries use the COLLATE clause and users could use the
COLLATE clause in their queries against system tables when using
string literals.
I wonder if we could just implement the COLLATE clause for string
literals. Would that be a lot of work? The metadata queries could be
adjusted to use the COLLATE clause. I think we should avoid violating
the SQL standard if we can.
So, in the absence of the COLLATE clause in Derby 10.3, what I am
proposing is string literals have a collation type of UNKNOWN. When
they get used in a collation operation, these UNKNOWN collation types
will get their collation from the other operands involved in the
operation (requirement here would be that all the operands whose
collation type is not UNKNOWN, will have the same collation type
associated with them. This requirement comes from Section 9.13
Collation determination, Syntax Rule 3e). If all of the operands in
collation operation have collation type as UNKNOWN, then the collation
chosen would be the one specified through the COLLATION attribute on
the JDBC url. This last line makes Derby's collation behavior not
match SQL spec Section 9.13 Collation determination, Rule 2 which says
"At least one operand shall have a declared type collation." But
again, I think in the absence of SQL's COLLATE clause, we have to
break some rules to make Derby more flexible. If we go with my
proposal for string literal, then a comparison like 'aa' < 'ba' will
have the 2 operands with UNKNOWN collation, and Derby's collation
algorithim at collation time will choose the value specified for
COLLATION attribute in the JDBC url. For function that returns a
string datatype, we will have the collation type for that string
datatype as UNKNOWN (same as string literals).
Other than this, I think Derby's collation type of UNKNOWN has a clear
mapping with "none" value of collation derivation in SQL spec.
I'm a little lost here. Could you explain what is meant by the UNKNOWN
collation? I don't see this collation described in either the SQL
standard or in the spec attached to DERBY-1478.
more good stuff ...
7)For CURRENT_USER, SESSION_USER, SYSTEM_USER, SQL spec Section 6.4
Syntax Rule 4 says that their collation type is the collation
of character set SQL_IDENTIFIER. SQL spec Section 4.2.6 Collations
talks about SQL_IDENTIFIER's collation being implementation defined.
Based on this, we can decide the collation for these USER functions
to be UNKNOWN. Again, the argument here is same as used for string
literals.
It seems to me that SQL_IDENTIFIERs have collation UCS_BASIC. This is
how they behave when it comes to preventing two schema objects from
having the same name. And this is what forces a collation of UCS_BASIC
on the string columns in the system tables. I'm not sure I understand
bullet (7). I hope it is not saying that SQL_IDENTIFIERs have UNKNOWN
collation. I think we will get into a lot of trouble if we try to force
SQL_IDENTIFIERs to have some collation other than UCS_BASIC.
Thanks,
-Rick
- Re: Collation feature discussion Rick Hillegas
-