Re: [darcs-devel] DARCS for Windows international development

Wim Lewis Thu, 31 May 2007 16:08:11 -0700


On May 31, 2007, at 3:32 PM, Paul Schauble wrote:

UTF-8 files are only smaller if the text is English only.

(Nit: Other European languages also do well in UTF8 since theyusually only have scattered non-ASCII characters. I agree with yourpoint re Chinese, though!)

Absent a BOM, is there another convention on Linux that allows you to
identify a UTF-8 file? Or does the program just have to know inadvance
that it's reading UTF-8?

The latter. The use of ZWNBS as a magic number for UTF-8 files is aMicrosoftism, as far as I know. I don't think I've seen it on otherplatforms.

I ask because the file reading routine I use examines the file fora BOM

and will interchangeably read Ansi in the system default code page,
UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
for darcs to identify the type of a file. But this only works if the
Linux and Unix conventions call for a BOM on Unicode files.

The question of what kind of character encoding a file uses is kindof like the question of what language the contents are written in(Spanish, C++, etc.). One reason that UTF8 is popular is that many/most utilities can remain agnostic about the character encoding ofthe text they're handling, without breaking anything.

In practice, if you know that something is either utf8 or utf16, it'seasy to distinguish, but I dislike programs that run more complicatedguessing algorithms over the text. Every now and then they get itwrong...



_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Re: [darcs-devel] DARCS for Windows international development

Reply via email to