Re: Encoding converion through JDBC

Philippe Verdy Thu, 05 Jun 2003 02:40:18 -0700

First I did not use the term "SQL Server", but "RDBMS". We spoke about JDBC which is 
made to be independant of any RDBMS (SQL Server included, which you should better 
designate as "Microsoft SQL Server", because Oracle also is, among other things, a SQL 
server).


Also, UCS-2 is officially dead! You won't find any use of it now in ISO/IEC 10646 
(where it was first used, and has now been replaced by "UCS-4" in some ISO/IEC 10646 
documents, and by the generic term "code point" in both Unicode and newer 10646 
documents).

Don't speak about UCS-2! Microsoft SQL Server does not (and will no more) support it. 
Instead refer to UTF-16 encoding form or to UTF-16* encoding schemes, or to codepoints 
in the BMP, which are less ambiguous!

Whatever your think, UCS-2 just means all the valid characters in the BMP, and it 
defines no definition in term of bits for its storage (so an UCS-2 character can be 
handled in a UTF-32 encoding form or stored/transmitted in one of the three UTF-32* 
encoding schemes). This is just a character set made of a bijective association 
between some astract characters (with defined properties) and an integer (in a limited 
range of values).

It's true that up to Unicode 3.1, all characters where only assigned in UCS-2 only, 
but the BMP also contains since long code points assigned to surrogates, and the fact 
that RDBMS systems could be used to store unpaired surrogates is just a legacy history 
where such check was relaxed (simply because surrogates were still not needed). Some 
non-conforming applications have used these surrogates as if they were PUA without 
encoding restrictions. Such applications need to be upgraded to use new PUA allocated 
in planes 15 and 16 with correctly paired surrogates. This supposes a database 
migration if they upgrade their server software, and possibly some new disk space to 
remap these illegal uses, or the application and database must be kept to only accept 
a past version of Unicode...

All RDBMS systems now will be enforcing the correct pairing of surrogates, even if 
they continue to use UTF-16 for their storage. But from the JDBC view, this is 
coherent with what Java does, because the "String" and class and native "char" type or 
"Character" class currently only allow storing 16-bits per characters. This may change 
soon, notably on 64 bit platforms (look precisely in the Java documentation, the size 
of a Java "char" is not explicitly specified and it could be larger, including in the 
bytecode instructions !

Some Java classes that assume that the "char" arithmetic will automatically roll after 
16 bits are wrong. The JVM spec only requires that char be at least 16-bit wide (but 
it may be larger). The compiled classes need to store string constants. But these 
constants are serialized to be platform independant using a UTF-8 encoding scheme.

Your sentence:
    the "N" prefix fields *always* mean UCS-2 for MSSQLS, period.
is wrong notably from the JDBC view, simply because we don't care about physical types 
on the server or networking protocol or middleware, but about JDBC datatypes. Look at 
Sybase (from which MSSQLS inherited its client library, its SQL language and syntax, 
and list of supported datatypes), you'll see that Sybase correctly handles ALL Unicode 
characters with NVARCHAR simply because it correctly uses the UTF-16 encoding and 
checks it in all its conversion libraries.

MSSQLServer will also perform these checks when converting a client encoding to 
storage, because it internally uses the MultiByteToWideChar() Win32 API (and its 
reversed API) to perform this job. These APIs *will* throw exceptions that the SQL 
Server will use to detect invalid UTF-16 sequences, if the server is tuned to do so, 
or if the client does not use UTF-16 but another encoding, or if the client uses an 
UTF-8 encoding with sequences of more that 3 bytes on a server or database that was 
not marked as being Unicode 3.0+ compatible. Look at the notices for Chinese users 
which most often use a codepage-9xx environment on Windows, and how Microsoft made its 
system and servers comply with the required support of GB18030 standard since 2000...

You should also look at upgrade notices in MSSQLServer releases. Your assumption are 
wrong if you just consider the relaxed encoding rules permitted in previous versions. 
For migration purpose, there exists tools that will check if a old database can be 
marked to support a more recent version of Unicode, and that will help a dbadmin 
diagnose the content of tables and SQL stored procs or triggers that need updates. If 
all tests pass, the database will be marked to the new version and it will be possible 
to clients to use a higher version of Unicode without experimenting conversion 
problems or alerts...

Form the JDBC point of view, these are driver-specific details. All that is needed in 
JDBC is to map a RDBMS type to some known Java classes, and be able to use instances 
of these types when "discussing" with a RDBMS, and handling possible conversion errors 
by mapping them to Java exceptions.

So use the Java String class type (which really is UTF-16 for now, but may be extended 
to UTF-32 on future 64-bit platforms), and ignore the other encoding details as this 
job is part of each JDBC driver (for now all of them will build String instances using 
UTF-16, but it will probably soon be possible to indicate to the JDBC driver the 
preference to UTF-32 in order to remove the handling of surrogates in the Java 
application).

Sun is discussing a lot the best way to standardize a common API for the full support 
of Unicode 3.2+, with a coherent set of methods for Strings, in a way that would reak 
the lowest number of applications (including those that made false assumptions about 
the size of a native "char"; some existing String APIs will be officially deprecated 
because they return a "char" instead of a int, or because they return indexes interms 
of code units and not in terms of code points).

The probable official full support of Unicode 4 and 3.2 will come with new classes 
derived from Character and String (UChar and UString are their name in the IBM ICU 
package, but Sun may also keep the class name but designate them under the java.text 
package insteads of the core's java.lang package, and a compiler option (such as the 
target Java version) may allow a class author to compile its code according to the 
default java.lang.String or java.text.String class if the package name is not 
specified by an explicit import).

-- Philippe.
----- Original Message ----- 
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Philippe Verdy" <[EMAIL PROTECTED]>
Sent: Wednesday, June 04, 2003 4:36 PM
Subject: Re: Encoding converion through JDBC


> From: "Philippe Verdy" <[EMAIL PROTECTED]>
> 
> Phillipe, you went on for quite a while and I admit most of the things you
> talked about are not thing about which I have knowledge. But some of the
> things you talked about, I do understand, and in those cases you were wrong.
> Psychologically, it causes me to wonder how much of the rest of this message
> converys accurate information.
> 
> Specifically, you talk about SQL Server but most of what you said about it
> is inaccurate. You cannot stored big endian data without risking corruptipn,
> you can only store UCS-2, it is not surrogate aware can can thus be said to
> truly support onlu UCS-2, not UTF-16, and the "N" prefix fields *always*
> mean UCS-2 for MSSQLS, period.
> 
> You have a gift -- that of being able to speak knowledgably. But please, use
> that gift for *good* and do not move past what you know.
> 
> Please, think about it?
> 
> MichKa
>

Re: Encoding converion through JDBC

Reply via email to