Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Richard Liang Mon, 27 Mar 2006 03:59:14 -0800

Stepan Mishura wrote:

On 3/27/06, Richard Liang wrote:

Nathan Beyer wrote:

I've seen similar differences between other VMs around the handling of

UTF-8

encoded data, especially between Sun and IBM VMs.  For example, if you

read

a file with a UTF-8 encoding that contains an invalid byte(s), the IBM

VM

will throw an IOException, but the Sun VM will convert the invalid

byte(s)

into the Unicode unknown character (diamond-backed-question-mark).


Personally, I prefer VMs that explicitly stick to Unicode and the

various

encodings and indicate error conditions.

Hello Nathan,

+1, we shall stick to Unicode and various encodings.




For me it is not obvious and I cannot make the choice.
Let's review the next theoretical situation: if the next Unicode spec.
update or corrigendum will require update that break Harmony backward
compatibility. Should we stick to the new Unicode version or be backward
compatible?

Hello Stepan,

For this situation, we may have three options:
1. Compliant with the new version of Unicode Spec
2. Compliant with the original version of Unicode Spec

3. Compliant with the new version of Unicode Spec and simultaneouslykeep some violation


I think 1 & 2 may be the proper answers, but 3 is not.

Let's think why we support Unicode. IMHO, it's because Unicode is abridge to ensure interoperability of applications from differentencoding system. If we announce that we support one version of Unicodeand simultaneously keep some violation. How can we ensure theinteroperability with other applications?

Thanks,
Stepan.

-Nathan

-----Original Message-----
From: Stepan Mishura [mailto:[EMAIL PROTECTED]
Sent: Friday, March 24, 2006 12:57 AM
To: harmony-dev
Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms

According to Unicode standart 4.0 (since 3.0) interpretation of non-
shortest
forms is forbidden for UTF-8. So if a byte sequence is not in table of
well-formed UTF-8 byte sequences then it is considered as ill-formed

and

treated as error. Harmony follows Unicode spec. but RI doesn't. I

didn't

find in the spec. explanation but I assume it is caused by backward
compatibility.

The following example demonstrates the difference. For example, code

point

'1071' should be represented by the next UTF-8 byte sequence <D0 AF>.

But

it
may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
shortest
form. So the following code prints "ERROR" on Harmony implementation

and

"Ok
with non-shortest forms" on RI

        String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,

(byte)

0xAF}, "UTF-8");
        String s2 = new String(new char[]{1071});

        if(s1.equals(s2)){
            System.out.println("Ok with non-shortest forms");
        } else {
            System.out.println("ERROR");
        }

We should decide whether we going to be compatible with RI or Unicode
spec.

Thanks,
Stepan Mishura
Intel Middleware Products Division



--
Thanks,
Stepan Mishura
Intel Middleware Products Division



--
Richard Liang

China Software Development Lab, IBM

Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Reply via email to