Richard Wordingham wrote:

I'm afraid I don't get the analogy.

You can't build a full Unicode system out of Unicode-compliant parts.

Others will have to address Richard's point about canonical-equivalent sequences.

However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
critical wording, "When converting from UTF-8 to Unicode values,
however, implementations do not need to check that the shortest
encoding is being used,...". There was no prohibition on
implementations performing the check, so whether C0 80 would be
interpreted as U+0000 or as an error was unpredictable.

So it is as I said, and as TUS said before Corrigendum #1 was approved, more than 16 years ago: It was not legal to create overlong sequences, but implementations were allowed to interpret any that they came across.

As someone who pays attention to the fine details, you will certainly appreciate the difference between "it was once legal to encode NUL as E0 80 80" and "it was once legal for a decoder to interpret the sequence E0 80 80 as NUL instead of rejecting it."

--
Doug Ewell | Thornton, CO, US | ewellic.org

Reply via email to