Hello, Dmirty, I agree with you that Harmony's behavior is not consistent with java spec.
As you may know, java.nio.charset.Charset wraps ICU to implement encode/decode operations. The following description is cited from ICU: ( http://icu.sourceforge.net/userguide/unicodeBasics.html) *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer either to character encoding forms where 16/32-bit words are processed and are naturally stored in the platform endianness, or they refer to the IANA-registered charset names, i.e., to character encoding schemes or byte serializations. In addition to simple byte serialization, the charsets with these names also use optional Byte Order Marks (see **Serialized Formats*<http://icu.sourceforge.net/userguide/unicodeBasics.html#serialized_formats> * below).* The result of running your test case on IBM jdk 1.4.2 is exactly the same as on Harmony. I guess IBM jdk 1.4.2 has passed TCK. Therefore, IMO, both behaviours are acceptable. What's your opinion? On 4/7/06, Dmitry M. Kononov <[EMAIL PROTECTED]> wrote: > > Hi Richard, > > On 4/6/06, Richard Liang <[EMAIL PROTECTED]> wrote: > > > Dmitry M. Kononov wrote: > > > As you exactly noticed the cause of this issue that Harmony uses the > > > little-endian byte order, if an encoded UTF-16 sequence has no > > byte-order > > > mark. However, the spec reads such a case explicitly as follows: > > > > > > "When decoding, the UTF-16 charset interprets a byte-order mark to > > indicate > > > the byte order of the stream but defaults to big-endian if there is no > > > byte-order mark; when encoding, it uses big-endian byte order and > writes > > a > > > big-endian byte-order mark." > > > > > > > > Hello Dmitry, > > > > Yes, although Harmony and RI use different byte order, as both Harmony > > and RI use byte-order mark (U+FEFF), I think both Harmony and RI are > > compliant with the specification. So could we regard Harmony-308 as "not > > a bug"? > > > I think Harmony's behavior in this case is inconsistent with the java > spec, > since the spec defines the expected behavior explicitly: > "when encoding, it uses big-endian byte order and writes a big-endian > byte-order mark." But Harmony's encode() returns bytes in the > little-endian > order. > > It seems I do not understand why do you think Harmony follows the spec > correctly in this case? :) > I am really sorry for my misunderstanding. > > From a test case attached to the HARMONY-308: > > 1) We have a char array that has no byte-order mark: > private static final char chars[] = { > > 0x041b,0x0435,0x0442,0x043e,0x0020,0x0432,0x0020,0x0420,0x043e,0x0441, > 0x0441,0x0438,0x0438}; > > 2) We have a byte array that encode() should return as we expect. > private static final byte bytes[] = { > (byte)254,(byte)255,(byte) 4,(byte) 27,(byte) 4,(byte) 53,(byte) > 4, > (byte) 66,(byte) 4,(byte) 62,(byte) 0,(byte) 32,(byte) 4,(byte) > 50, > (byte) 0,(byte) 32,(byte) 4,(byte) 32,(byte) 4,(byte) 62,(byte) > 4, > (byte) 65,(byte) 4,(byte) 65,(byte) 4,(byte) 56,(byte) 4,(byte) > 56}; > > Please note, according to the spec we expect bytes returned by encode() in > big-endian byte order. So, we expect the FEFF byte-order mark. > Do you agree this expectation is correct and consistent with the spec? > > Thanks. > -- > Dmitry M. Kononov > Intel Managed Runtime Division > > -- Andrew Zhang China Software Development Lab, IBM
