Stepan Mishura wrote:
On 3/27/06, Richard Liang wrote:
Nathan Beyer wrote:
I've seen similar differences between other VMs around the handling of
UTF-8
encoded data, especially between Sun and IBM VMs. For example, if you
read
a file with a UTF-8 encoding that contains an invalid byte(s), the IBM
VM
will throw an IOException, but the Sun VM will convert the invalid
byte(s)
into the Unicode unknown character (diamond-backed-question-mark).
Personally, I prefer VMs that explicitly stick to Unicode and the
various
encodings and indicate error conditions.
Hello Nathan,
+1, we shall stick to Unicode and various encodings.
For me it is not obvious and I cannot make the choice.
Let's review the next theoretical situation: if the next Unicode spec.
update or corrigendum will require update that break Harmony backward
compatibility. Should we stick to the new Unicode version or be backward
compatible?
Hello Stepan,
For this situation, we may have three options:
1. Compliant with the new version of Unicode Spec
2. Compliant with the original version of Unicode Spec
3. Compliant with the new version of Unicode Spec and simultaneously
keep some violation
I think 1 & 2 may be the proper answers, but 3 is not.
Let's think why we support Unicode. IMHO, it's because Unicode is a
bridge to ensure interoperability of applications from different
encoding system. If we announce that we support one version of Unicode
and simultaneously keep some violation. How can we ensure the
interoperability with other applications?
Thanks,
Stepan.
-Nathan
-----Original Message-----
From: Stepan Mishura [mailto:[EMAIL PROTECTED]
Sent: Friday, March 24, 2006 12:57 AM
To: harmony-dev
Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
According to Unicode standart 4.0 (since 3.0) interpretation of non-
shortest
forms is forbidden for UTF-8. So if a byte sequence is not in table of
well-formed UTF-8 byte sequences then it is considered as ill-formed
and
treated as error. Harmony follows Unicode spec. but RI doesn't. I
didn't
find in the spec. explanation but I assume it is caused by backward
compatibility.
The following example demonstrates the difference. For example, code
point
'1071' should be represented by the next UTF-8 byte sequence <D0 AF>.
But
it
may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
shortest
form. So the following code prints "ERROR" on Harmony implementation
and
"Ok
with non-shortest forms" on RI
String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,
(byte)
0xAF}, "UTF-8");
String s2 = new String(new char[]{1071});
if(s1.equals(s2)){
System.out.println("Ok with non-shortest forms");
} else {
System.out.println("ERROR");
}
We should decide whether we going to be compatible with RI or Unicode
spec.
Thanks,
Stepan Mishura
Intel Middleware Products Division
--
Thanks,
Stepan Mishura
Intel Middleware Products Division
--
Richard Liang
China Software Development Lab, IBM