Stefan O'Rear writes: > On Thu, May 31, 2007 at 03:32:04PM -0700, Paul Schauble wrote:
> > Absent a BOM, is there another convention on Linux that allows you to > > identify a UTF-8 file? Or does the program just have to know in advance > > that it's reading UTF-8? If *all* 8-bit characters come in groups, with the leading byte of the form 11bbbbbb and the later ones of the form 10bbbbbb, you're probably looking at UTF-8. (You can actually more precise; count the number of leading 1s in the first byte, say N, and it will be followed by exactly N-1 10bbbbbb-form bytes. The next byte will be either 0bbbbbbb or 11bbbbbb.) > On Linux, all files are encoded in the character set specified by the > locale environment variables; for instance LANG=en_US.UTF8 means to use > utf8. Quick perusal of system documentation seems to show that "local > charmap" prints the current encoding. This is highly unreliable for many users. In fact it's very likely that mbox files, saved HTML pages, and the like will be in various encodings. > Linux distributions are moving toward making UTF8 the default (and > hopefully someday, the only option). That will not happen any time soon, because there exist read-only media in legacy encodings. _______________________________________________ darcs-devel mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-devel
