-----Original Message-----
From: Samuel A. Falvo II [mailto:[EMAIL PROTECTED] 

Most Linux programs use UTF-8 for Unicode files, because they're
(minimum) 50% smaller without compromising functionality.  AFAIK, they
do not use a BOM for identifying UTF-8 files because BOMs aren't
required (and indeed, meaningless) for UTF-8 files.

-- 
Samuel A. Falvo II

------------------------

UTF-8 files are only smaller if the text is English only. I'm currently
working on a project in Simplified Chinese. Most Chinese characters are
3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
Java uses, some are 6.

Absent a BOM, is there another convention on Linux that allows you to
identify a UTF-8 file? Or does the program just have to know in advance
that it's reading UTF-8?

I ask because the file reading routine I use examines the file for a BOM
and will interchangeably read Ansi in the system default code page,
UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
for darcs to identify the type of a file. But this only works if the
Linux and Unix conventions call for a BOM on Unicode files.

    ++PLS
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to