On Thu, May 31, 2007 at 03:32:04PM -0700, Paul Schauble wrote:
> 
> 
> -----Original Message-----
> From: Samuel A. Falvo II [mailto:[EMAIL PROTECTED] 
> 
> Most Linux programs use UTF-8 for Unicode files, because they're
> (minimum) 50% smaller without compromising functionality.  AFAIK, they
> do not use a BOM for identifying UTF-8 files because BOMs aren't
> required (and indeed, meaningless) for UTF-8 files.
> 
> -- 
> Samuel A. Falvo II
> 
> ------------------------
> 
> UTF-8 files are only smaller if the text is English only. I'm currently
> working on a project in Simplified Chinese. Most Chinese characters are
> 3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
> Java uses, some are 6.
> 
> Absent a BOM, is there another convention on Linux that allows you to
> identify a UTF-8 file? Or does the program just have to know in advance
> that it's reading UTF-8?
> 
> I ask because the file reading routine I use examines the file for a BOM
> and will interchangeably read Ansi in the system default code page,
> UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
> for darcs to identify the type of a file. But this only works if the
> Linux and Unix conventions call for a BOM on Unicode files.

On Linux, all files are encoded in the character set specified by the
locale environment variables; for instance LANG=en_US.UTF8 means to use
utf8.  Quick perusal of system documentation seems to show that "local
charmap" prints the current encoding.

Linux distributions are moving toward making UTF8 the default (and
hopefully someday, the only option).

Stefan
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to