Gurusamy Sarathy writes:
: Treating literals as utf8 is a bit of a compatibility issue, but
: I think we should get around that by treating the lex input stream
: as any other discipline. IOW, default PL_rsfp to byte mode,
: and let users push a utf8/utf16/whatever discipline on it if they
: wanna. (This would apply to identifiers as well.)
They can always push a discipline explcitly, but I think we should just
auto-recognize it by default. Start with a generic discipline that
doesn't commit, then when you see a high bit, look and see if you've
got illegal utf8. If you do, it's not utf8. If you don't, it is
99.99% certain (in Latin-1 countries) to be utf8, and you can look ahead
some more if you want to be more certain. In non-Latin-1 countries
you probably have to push an explicit discipline anyway to tell it which
of many character sets you're using if you're not using utf8, so it
should still default to utf8 if it sees legal utf8.
If you'd like to think of it in stronger terms, the script is in utf8
until proven otherwise. Just start with the utf8 discipline by default
and make it smart enough to recover from errors by switching to an
8-bit/binary discipline if we haven't already committed too much to utf8.
This will also be a useful discipline for files of unknown provenance,
I think, but we probably wouldn't make it the default discipline for
ordinary files. It might possibly be the default utf8 discipline,
though there might be some call for a utf8_darnit discipline that would
puke on non-utf8 rather than trying to switch.