On Thu, May 31, 2007 at 03:52:50PM -0700, Samuel A. Falvo II wrote:
> On 5/31/07, Paul Schauble <[EMAIL PROTECTED]> wrote:
> >working on a project in Simplified Chinese. Most Chinese characters are
> >3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
> >Java uses, some are 6.
> 
> True, but most of the languages on this planet use the Latin character
> set, which _most_ seem to be covered by plain ANSI (codes 32-126).
> Hence, a more compact representation is had by a larger group of
> languages.
> 
> Also, "most" programs use UTF-8.  Some don't  --  IIRC, the "sam"
> editor uses raw UTF-16 for its output files, while GCC compiles code
> such that Unicode characters are 32-bits wide.
>
> >Absent a BOM, is there another convention on Linux that allows you to
> >identify a UTF-8 file? Or does the program just have to know in advance
> >that it's reading UTF-8?
> 
> At the end of the day, you just have to know.  However, most
> UTF-encoded files will have ".utf" extension instead of ".txt."

Hmm, I've never seen that one.

Stefan
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to