Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Doug Ewell via Unicode Wed, 17 May 2017 18:55:19 -0700

Richard Wordingham wrote:

I'm afraid I don't get the analogy.


You can't build a full Unicode system out of Unicode-compliant parts.

Others will have to address Richard's point about canonical-equivalentsequences.

However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
critical wording, "When converting from UTF-8 to Unicode values,
however, implementations do not need to check that the shortest
encoding is being used,...". There was no prohibition on
implementations performing the check, so whether C0 80 would be
interpreted as U+0000 or as an error was unpredictable.

So it is as I said, and as TUS said before Corrigendum #1 was approved,more than 16 years ago: It was not legal to create overlong sequences,but implementations were allowed to interpret any that they came across.

As someone who pays attention to the fine details, you will certainlyappreciate the difference between "it was once legal to encode NUL as E080 80" and "it was once legal for a decoder to interpret the sequence E080 80 as NUL instead of rejecting it."

--

Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to