Hi Godmar,
Am Die, 29 Aug 2000 schrieb Godmar Back:
> I was looking at this function in String.java:
>
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
> int len, ByteToCharConverter encoding) {
> StringBuffer sbuf = new StringBuffer(len);
> char[] out = new char[512];
> int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
> while (outlen > 0) {
> sbuf.append(out, 0, outlen);
> outlen = encoding.flush(out, 0, out.length);
> }
> return sbuf;
> }
> ----
>
> Why can't this function be rewritten to read:
>
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
> int len, ByteToCharConverter encoding) {
> char[] out = new char[len];
> int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
> return new StringBuffer(outlen).append(out, 0, outlen);
> }
> ----
>
> Is it not fair to assume that converting n bytes will result in less than
> or equal to n characters?
For most of encodings that I've seen, it is a safe assumption.
Unfortunately, I haven't seen 'em all :)
I'm suspicious that it's
possible to have a byte encode several characters. And here is why:
Unicode supports "combining" characters. These characters are used to
modify other characters. For example, you can add accents to
normal characters. Since Unicode is designed to allow easy conversion
to/from existing character sets, it includes many precomposed
characters, like the german umlauts �,�,�. You'd still need combining
characters to fully represent some scripts, like Thai. Markus
Kuhn says in his "UTF-8 and Unicode FAQ for Unix/Linux" [1] : "with
the Thai script, up to two combining characters are needed on a single
base character. "
In his article on "Forms of Unicode" [2], Mark Davis shows some of the
myths about characters vs code points vs code units. It features a
table with some unexpected things. There is an encoding for the fi
ligature, for example [3]. Some arabian characters' Unicode
representation depends on the context. Some characters require
several Unicode characters to be represented properly: "The Devangari
syllable ksha is represented by three code points."
I haven't seen an encoding for Devangari, so I don't know whether the
encoding for "ksha" would be less than three bytes. I've seen other
encodings (doing research for this post today), collected by Mark
Leisher as a supplement to the official Unicode conversion tables. And
some of them, like I3342, encode a single byte into several characters
[4]. I don't think any of these encodings is supported by Sun's JDK 1.3,
though.
To sum it up: I'm not convinced. I guess taking a look at GNU
libc iconv functionality should provide some more insight, but I don't
have the sources around right now. The GNU libc folks have done a
massive job supporting a variety of encodings, so this might be another
direction to look for advice..
Read ya,
Dali
[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[2] ftp://www6.software.ibm.com/software/developer/library/utfencodingforms.pdf
[3] \uFB01 according to Unicode-Data-3.0.txt
[4] 0xA4 -> 0x0631 0x064A 0x0627 0x0644 for PERSIAN RIAL SIGN
__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com