On 5/31/07, Paul Schauble <[EMAIL PROTECTED]> wrote:
working on a project in Simplified Chinese. Most Chinese characters are 3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that Java uses, some are 6.
True, but most of the languages on this planet use the Latin character set, which _most_ seem to be covered by plain ANSI (codes 32-126). Hence, a more compact representation is had by a larger group of languages. Also, "most" programs use UTF-8. Some don't -- IIRC, the "sam" editor uses raw UTF-16 for its output files, while GCC compiles code such that Unicode characters are 32-bits wide.
Absent a BOM, is there another convention on Linux that allows you to identify a UTF-8 file? Or does the program just have to know in advance that it's reading UTF-8?
At the end of the day, you just have to know. However, most UTF-encoded files will have ".utf" extension instead of ".txt." -- Samuel A. Falvo II _______________________________________________ darcs-devel mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-devel
