On May 31, 2007, at 3:32 PM, Paul Schauble wrote:
UTF-8 files are only smaller if the text is English only.

(Nit: Other European languages also do well in UTF8 since they usually only have scattered non-ASCII characters. I agree with your point re Chinese, though!)

Absent a BOM, is there another convention on Linux that allows you to
identify a UTF-8 file? Or does the program just have to know in advance
that it's reading UTF-8?

The latter. The use of ZWNBS as a magic number for UTF-8 files is a Microsoftism, as far as I know. I don't think I've seen it on other platforms.

I ask because the file reading routine I use examines the file for a BOM
and will interchangeably read Ansi in the system default code page,
UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
for darcs to identify the type of a file. But this only works if the
Linux and Unix conventions call for a BOM on Unicode files.

The question of what kind of character encoding a file uses is kind of like the question of what language the contents are written in (Spanish, C++, etc.). One reason that UTF8 is popular is that many/ most utilities can remain agnostic about the character encoding of the text they're handling, without breaking anything.

In practice, if you know that something is either utf8 or utf16, it's easy to distinguish, but I dislike programs that run more complicated guessing algorithms over the text. Every now and then they get it wrong...


_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to