On 5/31/07, Paul Schauble <[EMAIL PROTECTED]> wrote:
working on a project in Simplified Chinese. Most Chinese characters are
3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
Java uses, some are 6.

True, but most of the languages on this planet use the Latin character
set, which _most_ seem to be covered by plain ANSI (codes 32-126).
Hence, a more compact representation is had by a larger group of
languages.

Also, "most" programs use UTF-8.  Some don't  --  IIRC, the "sam"
editor uses raw UTF-16 for its output files, while GCC compiles code
such that Unicode characters are 32-bits wide.

Absent a BOM, is there another convention on Linux that allows you to
identify a UTF-8 file? Or does the program just have to know in advance
that it's reading UTF-8?

At the end of the day, you just have to know.  However, most
UTF-encoded files will have ".utf" extension instead of ".txt."

--
Samuel A. Falvo II
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to