Hi Godmar,

Am Die, 29 Aug 2000 schrieb Godmar Back:
> I was looking at this function in String.java:
> 
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
>                 int len, ByteToCharConverter encoding) {
>         StringBuffer sbuf = new StringBuffer(len);
>         char[] out = new char[512];
>         int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
>         while (outlen > 0) {
>                 sbuf.append(out, 0, outlen);
>                 outlen = encoding.flush(out, 0, out.length);
>         }
>         return sbuf;
> }
> ----
> 
> Why can't this function be rewritten to read:
> 
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
>                 int len, ByteToCharConverter encoding) {
>         char[] out = new char[len];
>         int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
>       return new StringBuffer(outlen).append(out, 0, outlen);
> }
> ----
> 
> Is it not fair to assume that converting n bytes will result in less than
> or equal to n characters?

For most of encodings that I've seen, it is a safe assumption.
Unfortunately, I haven't seen 'em all :) 

I'm suspicious that it's
possible to have a byte encode several characters. And here is why:
Unicode supports "combining" characters. These characters are used to
modify other characters. For example, you can add accents to
normal characters. Since Unicode is designed to allow easy conversion
to/from existing character sets, it includes many precomposed
characters, like the german umlauts �,�,�. You'd still need combining
characters to fully represent some scripts, like Thai. Markus
Kuhn says in his "UTF-8 and Unicode FAQ for Unix/Linux" [1] : "with
the Thai script, up to two combining characters are needed on a single
base character. "

In his article on "Forms of Unicode" [2], Mark Davis shows some of the
myths about characters vs code points vs code units. It features a
table with some unexpected things. There is an encoding for the fi
ligature, for example [3]. Some arabian characters' Unicode
representation depends on the context.  Some characters require
several Unicode characters to be represented properly: "The Devangari
syllable ksha is represented by three code points."

I haven't seen an encoding for Devangari, so I don't know whether the
encoding for "ksha" would be less than three bytes. I've seen other
encodings (doing research for this post today), collected by Mark
Leisher as a supplement to the official Unicode conversion tables. And
some of them, like I3342, encode a single byte into several characters
[4]. I don't think any of these encodings is supported by Sun's JDK 1.3,
though.

To sum it up: I'm not convinced. I guess taking a look at GNU
libc iconv functionality should provide some more insight, but I don't
have the sources around right now. The GNU libc folks have done a
massive job supporting a variety of encodings, so this might be another
direction to look for advice..

Read ya,

Dali

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[2] ftp://www6.software.ibm.com/software/developer/library/utfencodingforms.pdf
[3] \uFB01 according to Unicode-Data-3.0.txt
[4] 0xA4         -> 0x0631 0x064A 0x0627 0x0644  for PERSIAN RIAL SIGN


__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com

Reply via email to