First I did not use the term "SQL Server", but "RDBMS". We spoke about JDBC which is
made to be independant of any RDBMS (SQL Server included, which you should better
designate as "Microsoft SQL Server", because Oracle also is, among other things, a SQL
server).
Also, UCS-2 is officially dead! You won't find any use of it now in ISO/IEC 10646
(where it was first used, and has now been replaced by "UCS-4" in some ISO/IEC 10646
documents, and by the generic term "code point" in both Unicode and newer 10646
documents).
Don't speak about UCS-2! Microsoft SQL Server does not (and will no more) support it.
Instead refer to UTF-16 encoding form or to UTF-16* encoding schemes, or to codepoints
in the BMP, which are less ambiguous!
Whatever your think, UCS-2 just means all the valid characters in the BMP, and it
defines no definition in term of bits for its storage (so an UCS-2 character can be
handled in a UTF-32 encoding form or stored/transmitted in one of the three UTF-32*
encoding schemes). This is just a character set made of a bijective association
between some astract characters (with defined properties) and an integer (in a limited
range of values).
It's true that up to Unicode 3.1, all characters where only assigned in UCS-2 only,
but the BMP also contains since long code points assigned to surrogates, and the fact
that RDBMS systems could be used to store unpaired surrogates is just a legacy history
where such check was relaxed (simply because surrogates were still not needed). Some
non-conforming applications have used these surrogates as if they were PUA without
encoding restrictions. Such applications need to be upgraded to use new PUA allocated
in planes 15 and 16 with correctly paired surrogates. This supposes a database
migration if they upgrade their server software, and possibly some new disk space to
remap these illegal uses, or the application and database must be kept to only accept
a past version of Unicode...
All RDBMS systems now will be enforcing the correct pairing of surrogates, even if
they continue to use UTF-16 for their storage. But from the JDBC view, this is
coherent with what Java does, because the "String" and class and native "char" type or
"Character" class currently only allow storing 16-bits per characters. This may change
soon, notably on 64 bit platforms (look precisely in the Java documentation, the size
of a Java "char" is not explicitly specified and it could be larger, including in the
bytecode instructions !
Some Java classes that assume that the "char" arithmetic will automatically roll after
16 bits are wrong. The JVM spec only requires that char be at least 16-bit wide (but
it may be larger). The compiled classes need to store string constants. But these
constants are serialized to be platform independant using a UTF-8 encoding scheme.
Your sentence:
the "N" prefix fields *always* mean UCS-2 for MSSQLS, period.
is wrong notably from the JDBC view, simply because we don't care about physical types
on the server or networking protocol or middleware, but about JDBC datatypes. Look at
Sybase (from which MSSQLS inherited its client library, its SQL language and syntax,
and list of supported datatypes), you'll see that Sybase correctly handles ALL Unicode
characters with NVARCHAR simply because it correctly uses the UTF-16 encoding and
checks it in all its conversion libraries.
MSSQLServer will also perform these checks when converting a client encoding to
storage, because it internally uses the MultiByteToWideChar() Win32 API (and its
reversed API) to perform this job. These APIs *will* throw exceptions that the SQL
Server will use to detect invalid UTF-16 sequences, if the server is tuned to do so,
or if the client does not use UTF-16 but another encoding, or if the client uses an
UTF-8 encoding with sequences of more that 3 bytes on a server or database that was
not marked as being Unicode 3.0+ compatible. Look at the notices for Chinese users
which most often use a codepage-9xx environment on Windows, and how Microsoft made its
system and servers comply with the required support of GB18030 standard since 2000...
You should also look at upgrade notices in MSSQLServer releases. Your assumption are
wrong if you just consider the relaxed encoding rules permitted in previous versions.
For migration purpose, there exists tools that will check if a old database can be
marked to support a more recent version of Unicode, and that will help a dbadmin
diagnose the content of tables and SQL stored procs or triggers that need updates. If
all tests pass, the database will be marked to the new version and it will be possible
to clients to use a higher version of Unicode without experimenting conversion
problems or alerts...
Form the JDBC point of view, these are driver-specific details. All that is needed in
JDBC is to map a RDBMS type to some known Java classes, and be able to use instances
of these types when "discussing" with a RDBMS, and handling possible conversion errors
by mapping them to Java exceptions.
So use the Java String class type (which really is UTF-16 for now, but may be extended
to UTF-32 on future 64-bit platforms), and ignore the other encoding details as this
job is part of each JDBC driver (for now all of them will build String instances using
UTF-16, but it will probably soon be possible to indicate to the JDBC driver the
preference to UTF-32 in order to remove the handling of surrogates in the Java
application).
Sun is discussing a lot the best way to standardize a common API for the full support
of Unicode 3.2+, with a coherent set of methods for Strings, in a way that would reak
the lowest number of applications (including those that made false assumptions about
the size of a native "char"; some existing String APIs will be officially deprecated
because they return a "char" instead of a int, or because they return indexes interms
of code units and not in terms of code points).
The probable official full support of Unicode 4 and 3.2 will come with new classes
derived from Character and String (UChar and UString are their name in the IBM ICU
package, but Sun may also keep the class name but designate them under the java.text
package insteads of the core's java.lang package, and a compiler option (such as the
target Java version) may allow a class author to compile its code according to the
default java.lang.String or java.text.String class if the package name is not
specified by an explicit import).
-- Philippe.
----- Original Message -----
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Philippe Verdy" <[EMAIL PROTECTED]>
Sent: Wednesday, June 04, 2003 4:36 PM
Subject: Re: Encoding converion through JDBC
> From: "Philippe Verdy" <[EMAIL PROTECTED]>
>
> Phillipe, you went on for quite a while and I admit most of the things you
> talked about are not thing about which I have knowledge. But some of the
> things you talked about, I do understand, and in those cases you were wrong.
> Psychologically, it causes me to wonder how much of the rest of this message
> converys accurate information.
>
> Specifically, you talk about SQL Server but most of what you said about it
> is inaccurate. You cannot stored big endian data without risking corruptipn,
> you can only store UCS-2, it is not surrogate aware can can thus be said to
> truly support onlu UCS-2, not UTF-16, and the "N" prefix fields *always*
> mean UCS-2 for MSSQLS, period.
>
> You have a gift -- that of being able to speak knowledgably. But please, use
> that gift for *good* and do not move past what you know.
>
> Please, think about it?
>
> MichKa
>