Re: Encoding converion through JDBC

Philippe Verdy Thu, 05 Jun 2003 00:58:16 -0700

In all major databases, the native encodings of the OS, of the database when it was 
created, of the networking protocol, of the SQL queries and results, and of the client 
application are all independant.


When JDBC connects to a database, it gets a lot of environment information from the 
server (true for Oracle, Sybase, MSSQL, Informix, DB2) and the client library manages 
the conversion to the best encoding supported by both the client and the server in 
order to minimize transcoding fallbacks. As such, the networking protocol is not 
always using an Unicode encoding scheme, but may use the same encoding for the client 
and server, if both support this encoding.

Encoding negociation if part of the RDBMS-specific networking protocol (SQL*Net or 
SQL*NetV2 for Oracle, TDS for Sybase and MSSQL, IFX-Net for Informix, ... I'm not sure 
about other RDBMS engines and protocols). These protocols exist since long even before 
Unicode and ISO10646 were standardized, at a time where the management of multiple 
encodings was really needed. Unicode encoding forms are just supplementary options for 
this negociation.

Generally the solution adopted will be to use the needed conversion transparently, but 
it supposes some capability of the server table's to store these characters, and some 
support for the client to limit fallbacks (depending on the RDBMS, such fallbacks may 
be done without any explicit indication to the client application, or it can produce 
SQL errors, or the application can specify the behavior using triggers either in the 
client library or through administrative SQL commands for the session.

I's not JDBC iself hat performs these adaptations: most of the work is performed 
directly in the native client library (preferably as it scales better and offers more 
control in the application) of the specific RDBMS engine, with some conversions 
sometimes accepted by the server (but most servers will just raise and exception or 
SQL error that will invalidate the transaction if a network encoding is no conerible 
to the encoding of the storage tables).

Note also: you use the term "UCS-2" inaccurately. NVARCHARs are storing code units, 
not code points. UCS-2 is designating only the subrange of valid characters in the 
BMP. In fact, you should have said that NVARCHARs are just defined to allow storing 
international characters. Depending on the way the database was created, this may 
store strings as UTF-8 or UTF-16.

If the database is portable and architecture independant, the native byte order of the 
running platform may not be decisive, bu instead will be specified by a global flag 
stored in the table descriptor or in the master catalog of the database (so that a 
database can be closed and its storage remounted to another server without requiring a 
lengthy import).

For Oracle and Sybase or MSSQL, NVARCHARs are dababase types, but they may be 
constrained or tuned during the database creation, so that code units will be 8 bit or 
16 bit. This however does not change the length constraint, which is always considered 
in terms of logical characters (and checked only like run-time triggers) and not in 
terms of absolute byte length (so NVARCHARs and even NCHARs are stored as varying 
length byte sequences). 

So N(VAR)CHARs and (VAR)CHARs only differ logically by the fact that the set of 
allowed characters for N(VAR)CHARs must *include* the set of allowed characters for 
(VAR)CHARs (and thus *may* be larger). Both sets may or may not be equal and each one 
may have a distinct encoding in terms of code units.

There is no guarantee that NVARCHAR will always accept all Unicode characters (this 
condition can be restricted by database creation constraints or tuning), and a 
database may be created so that it will accept the whole Unicode character set in 
CHARs or VARCHARs (in that case NCHARs or NVARCHARs will be most often equivalent in 
terms of processing and storage).

The main interest of N(VAR)CHARs (generally implemented to support all Unicode 
characters) is that it allows fine tuning of the storage requirements by separating 
strings that are not designed to be translated (such as referencial keys) and those 
that will store localizable data (including secondary search keys).

The length constraint on N(VAR)CHAR is just adding processing time, because such 
datatype is almost always requiring a varying length for its storage (this is why it 
is often not adequate for referencial keys). But it is still better than using text 
LOBs for performance reasons (and also for storage requirement reasons, as most LOBs 
will allocate space with a large granularity of 1 database page for each text/string, 
most often set to 2KB or 4KB, or because modification of LOBs may be excluded from the 
"rollbackable" transactions).

In a pure logical design, all these variants of CHAR, VARCHAR, NCHAR, NVARCHAR, and 
TEXT LOBs are equivalent. They all designate text, but these multiple RDBMS datatypes 
are just adding opportunities for specific tuning according to some constraints, or 
just add a distinction in the handling of he implicit space padding (which you would 
expect logically when exporting to a fixed file or when comparing keys, but may or may 
not be physically stored, depending on RDBMS-specific tuning).

For all these reasons, such tuning of encoding depends mostly of the capability of the 
JDBC driver. In Java, you will always use Strings internally handled as UTF-16 code 
units with most APIs, except in serialization of class files which use UTF-8 for 
String constants before they are loaded and initialized at run-time. JDBC itself does 
not make this conversion.

This is the work of the JDBC driver, which can use conversion tools and tables from 
the RDBMS server or from its native client library, or from the Java toolkit. For 
performance reasons, each driver has the choice of how it will perform it. So normally 
you don't have to worry about that in all transactional requests, except if you create 
a database through JDBC.

(This should rarely be the case, as there's too many RDBMS specific features to tune 
up according to an application schema and storage architecture and security, so if you 
use JDBC, this Java code will use RDBMS-specific requests that will not be portable to 
other architectures, and should be reserved to building an application for DB-admins, 
and the choice of the RDBMS will be critical for such application, if you depend a lot 
on it, as it will limit your deployment possibilities, and future scalability on other 
shared platforms, or shared application design tools and teams).

The common JDBC reference will not show you the exact details. In fact, the details of 
encodings supported and prefereably selected for your JDBC application is specific to 
each RDBMS. But I really think that the encoding of strings is an internal 
consideration that JDBC applications should not be exposed to (so I would not like to 
use an encoding specifier in an URL, as this is not the role of an URL to specify it, 
but instead the role of the client middleware to negotiate the best encoding supported 
by both the client application and the remote server, so that the same URL can be used 
across platforms).

Because of this consideration, I think that this discussion is no relevant in this 
newsgroup. This problem is not related to Unicode but to the tuning and portability of 
JDBC middlewares and applications. I would be better to continue this discussion in 
the online Sun forums or lists for Java related to JDBC.

(Unicode is just a minor aspect and non mondatory response to your question, and your 
real problem is not Unicode, but the choice of the best JDBC drivers or RDBMS engines 
for your application needs or its deployment; as part of your question you should 
include the problem of the administration of your database, and the cost and ease with 
which you'll be ale to guarantee its security and performance, and how you'll scale 
it. This goes to far for what can be discussed here, and we won't speak further about 
JDBC and its relevance as a solution for your development and deployment).

-- Philippe.
----- Original Message ----- 
From: "souravm" <[EMAIL PROTECTED]>

Hi All,
 
I've some query on how JDBC is supposed to convert encoding of
characters while updating/selecting multibyte strings to/from a Database
supporting Unicode characters. The specific queries are like ...

Re: Encoding converion through JDBC

Reply via email to