RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Carl W. Brown Wed, 30 May 2001 14:33:32 -0700

Simon,

Thanks for the information. I am very glad that Oracle will be supporting these characters sets properly. I look forward to using 9i. Since Oracle will transform the Unicode from one encoding to another at the API layer, I don't see why users can not retrieve the data in a single format. If they retrieve it as UTF-8 that it will have the natural Unicode sorting order. If they retrieve it as UTF-16 then they should processing as all other UTF-16 applications and convert to UTF-32 at least on a character by character basis for compares. Implementing special UTF-8 and UTF-32 system that sort like UTF-16 is folly.

Carl

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Simon Law
Sent: Wednesday, May 30, 2001 11:02 AM
To: [EMAIL PROTECTED]
Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Hi Folks,
Over the last few days, this email thread has generated many interesting discussions on the proposal of UTF-8s. At the same time some speculations have been generated on why Oracle is asking for this encoding form. I hope to clarify some of these misinformation in this email.
In Oracle9i our next Database Release shipping this summer, we have introduced support for two new Unicode character sets. One is 'AL16UTF16' which supports the UTF-16 encoding and the other is 'AL32UTF8' which is the UTF-8 fully compliant character set. Both of these conform to the Unicode standard, and surrogate characters are stored strictly in 4 bytes. For more information on Unicode support in Oracle9i , please check out the whitepaper "The power of Globalization Technology" on http://otn.oracle.com/products/oracle9i/content.html
The requests for UTF-8s came from many of our Packaged Applications customers (such as Peoplesoft , SAP etc.), the ordering of the binary sort is an important requirement for these Oracle customers. We are supporting them and we hope to turn this into a TR such that UTF-8s can be referenced by other vendors when they need to have compatible binary order for UTF-16 and UTF-8 across different platforms.
The speculation that we are pushing for UTF-8s because we are trying to minimize our code change for supporting surrogates, or because of our unique database design are totally false. Oracle has a fully internationalized extensible architecture and have introduced surrogate support in Oracle9i. In fact we are probably the first database vendor to support both the UTF-16 and UTF-8 encoding forms, we will continue to support them and conform to future enhancements to the Unicode Standard.
Regards

Simon
"Carl W. Brown" wrote:
Ken,
I suspect that Oracle is specifically pushing for this standard because of
its unique data base design. In a sense Oracle almost picks it self up by
its own bootstraps. It has always tried to minimize actual code. Therefore
it was a natural choice to implement Unicode with UTF-8 because it is easy
to reuse the multibyte support with minor changes to handle a different
character length algorithm. This has been one of the reasons that Oracle
has been successful. Its tinker toy like design has enabled them to quickly
adapt and add new features. Now however, they should take the time do "do
it right". Its UTF-8 storage creates problems for database designers
because they can not predict field sizes. This is a problem with MBCS code
pages but UTF-8s will make it worse. There will be lots of wasted storage
when characters can vary in size from 1 to 6 bytes.
Most other database systems require specific code to support Unicode. As a
consequence most have implemented using UCS-2. Their migration is obviously
to use UTF-16. UTF-8s buys them nothing but headaches.
Carl
-----Original Message-----
From: Kenneth Whistler [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 29, 2001 3:47 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
email)
Carl,
> Ken,
>
> UTF-8s is essentially a way to ignore surrogate processing. It allows a
> company to encode UTF-16 with UCS-2 logic.
>
> The problem is that by not implementing surrogate support you can
introduce
> subtle errors. For example it is common to break buffers apart into
> segments. These segments may be reconcatinated but they may be processed
> individually.
You are preaching to the choir here. I didn't state that *I* was in
favor of UTF-8S -- only that we have to be careful not to assume that
UTC will obviously not support it. The proponents of UTF-8S are
vigorously and actively campaigning for their proposal. In
standardization committees, proposals that have committed, active
proponents who can aim for the long haul, often have a way of getting
adopted in one form or another, unless there are equally committed
and active opponents of the proposal. It is just the nature of
consensus politicking in these committees, whether corporate based
or national body based.
Also, I consider the stated position of "near-universal agreement
among the database vendors" to be largely a rhetorical device by
the proponents. Oracle is clearly pushing the proposal. NCR has
stated it is not in favor of the proposal. The other big enterprise
database vendors are hedging their positions somewhat -- in
particular, the standards people in those companies may not be
entirely in agreement with some of their database engine developers, for
example. And the small database vendors are either not playing
in this space or are part of desktop systems that will just follow
the behavior of the platforms.
--Ken

RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Reply via email to