There is some sense in this. The same sort of slovenly implementation
which might treat 0xC0 0xAF (non-minimal encoding) as '/' sometimes but
not always, might well also treat 0xFD 0x80 0x80 0x80 0x80 0xAF (code
point far outside the Unicode range) as '/' sometimes but not always.
If you think it is best to restrict the spec to fix the first problem (as
opposed to, say, shooting the incompetent programmer), restricting it
further to fix the second is also reasonable.
These are two separate issues, and should not be tied together as such. Overcoded sequences are clearly invalid. "Out-of-range" sequences are different:
Now, not everyone agrees that trying to fix *either* of these problems by
standards engineering was a sensible approach, but there is no doubt that
it *was* done and the current standards *do* call for it.
Do you think that an I/O layer should check for high-surrogate codepoints encoded into UTF-8 and perform some arbitrary action on those? What about U00FFFF and U00FFFE, or a stray BOM in the middle of some text?
I dont think it's the job of an I/O layer to make such decisions. The maximum value of the code point range is no different than defining 0x110000-0x7FFFFFFF to be "not a character". I highly doubt that most existing implementations are truly as pedantic as the standard could be interpreted to call for. Prematurely handling code point filtering I would consider to be splitting the hair on the wrong end.
Now, on the other hand, delivering a document marked as UTF-8 which contains any of the above mentioned code-points would be incorrect. A document is supposed to be a finished product, fully filtered and normalized.
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
