-----Original Message-----
From: Samuel A. Falvo II [mailto:[EMAIL PROTECTED]
Most Linux programs use UTF-8 for Unicode files, because they're
(minimum) 50% smaller without compromising functionality. AFAIK, they
do not use a BOM for identifying UTF-8 files because BOMs aren't
required (and indeed, meaningless) for UTF-8 files.
--
Samuel A. Falvo II
------------------------
UTF-8 files are only smaller if the text is English only. I'm currently
working on a project in Simplified Chinese. Most Chinese characters are
3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
Java uses, some are 6.
Absent a BOM, is there another convention on Linux that allows you to
identify a UTF-8 file? Or does the program just have to know in advance
that it's reading UTF-8?
I ask because the file reading routine I use examines the file for a BOM
and will interchangeably read Ansi in the system default code page,
UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
for darcs to identify the type of a file. But this only works if the
Linux and Unix conventions call for a BOM on Unicode files.
++PLS
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel