On Thu, May 31, 2007 at 03:52:50PM -0700, Samuel A. Falvo II wrote: > On 5/31/07, Paul Schauble <[EMAIL PROTECTED]> wrote: > >working on a project in Simplified Chinese. Most Chinese characters are > >3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that > >Java uses, some are 6. > > True, but most of the languages on this planet use the Latin character > set, which _most_ seem to be covered by plain ANSI (codes 32-126). > Hence, a more compact representation is had by a larger group of > languages. > > Also, "most" programs use UTF-8. Some don't -- IIRC, the "sam" > editor uses raw UTF-16 for its output files, while GCC compiles code > such that Unicode characters are 32-bits wide. > > >Absent a BOM, is there another convention on Linux that allows you to > >identify a UTF-8 file? Or does the program just have to know in advance > >that it's reading UTF-8? > > At the end of the day, you just have to know. However, most > UTF-encoded files will have ".utf" extension instead of ".txt."
Hmm, I've never seen that one. Stefan _______________________________________________ darcs-devel mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-devel
