On Thu, Aug 29, 2013 at 12:28 AM, Bennie Kloosteman <[email protected]>wrote:

> ..And Unicode is not even official here , officially you should use ASCII
> and then use an encoding scheme GB or GBK ( Unicode  cant do newer chars so
> they do this encoding on unicode anyway and suffer a double whamy because
> the encoded chars are wider) .
>

I've run into this problem in Japanese as well. The result is that proper
eastern-language I18N ends up forced into byte[] instead of UTF8 strings
anyhow.

I'm in favor of UTF8 strings, and also of "chunky" strings in which
> sub-runs are encoded using the most efficient encoding for the run.


Back at egroups/yahoo-groups, we used a UTF-8 compatible "chunk-marked"
encoding we called ME8 written by Gaku Ueda. It allowed marking a chunk
with a specific charset-encoding, to solve some of the issues Bennie
mentioned. I thought there was some more public draft written up about it,
but the best I could find is this... it explains how the sequence was
craftily chosen to be distinguishable from UTF-8 sequences.

http://dj1.willowmail.com/~jeske/_drop/ME8_chunked_charset_encoding.txt

Regarding the lack of direct string[i] indexing, in all of the email/web
I18N stuff I've worked on, strings are nearly always stream-processed, so
there isn't any need for random access.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to