Stefan O'Rear writes:

 > On Thu, May 31, 2007 at 03:32:04PM -0700, Paul Schauble wrote:

 > > Absent a BOM, is there another convention on Linux that allows you to
 > > identify a UTF-8 file? Or does the program just have to know in advance
 > > that it's reading UTF-8?

If *all* 8-bit characters come in groups, with the leading byte of the
form 11bbbbbb and the later ones of the form 10bbbbbb, you're probably
looking at UTF-8.  (You can actually more precise; count the number of
leading 1s in the first byte, say N, and it will be followed by
exactly N-1 10bbbbbb-form bytes.  The next byte will be either
0bbbbbbb or 11bbbbbb.)

 > On Linux, all files are encoded in the character set specified by the
 > locale environment variables; for instance LANG=en_US.UTF8 means to use
 > utf8.  Quick perusal of system documentation seems to show that "local
 > charmap" prints the current encoding.

This is highly unreliable for many users.  In fact it's very likely
that mbox files, saved HTML pages, and the like will be in various
encodings.

 > Linux distributions are moving toward making UTF8 the default (and
 > hopefully someday, the only option).

That will not happen any time soon, because there exist read-only
media in legacy encodings.

_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to