On Sun, Feb 2, 2014 at 10:16 PM, Karl Berry <[email protected]> wrote: > > * Default encoding is set as UTF-8 - decide whether this is desired > All I can think of to base the default on the current locale, because > that's the only information we've got about what the user desires. > E.g., if the locale is "C" (or, equivalently, "POSIX", of course), the > target should be plain 7-bit ASCII. If the locale is *.UTF-8, then the > target should be UTF-8. Etc. (I don't know all the locale names used > in this context, and can't find anything that seems like a comprehensive > list, although it must be out there somewhere.)
Default file encoding set to UTF-8, that is, not output encoding - output encoding is set from the locale. I would think that we should leave files as they are if we don't know their encoding - that way we don't risk breaking something that works already. On the subject of interpreting ISO-8859 text as UTF-8 and passing through any unrecognized byte sequences, I think Per Bothner is right that this could fail. The problem is less of a problem because there is a gap in the encoding from code points 80 to 9f, so a byte sequence like 110xxxxx 10yyyyyy could only be incorrectly interpreted as UTF-8 if the second byte was in the range a0 to bf, that is there are 32 characters we could lose, which might not be used much anyway in existing info files. I'd like some better evidence it wouldn't be a problem, though.
