Richard Wordingham wrote: >> It is not at all clear what the intent of the encoder was - or even >> if it's not just a problem with the data stream. E0 80 80 is not >> permitted, it's garbage. An encoder can't "intend" it. > > It was once a legal way of encoding NUL, just like C0 E0, which is > still in use, and seems to be the best way of storing NUL as character > content in a *C string*.
I wish I had a penny for every time I'd seen this urban legend. At http://doc.cat-v.org/bell_labs/utf-8_history you can read the original definition of UTF-8, from Ken Thompson on 1992-09-08, so long ago that it was still called FSS-UTF: "When there are multiple ways to encode a value, for example UCS 0, only the shortest encoding is legal." Unicode once permitted implementations to *decode* non-shortest forms, but never allowed an implementation to *create* them (http://www.unicode.org/versions/corrigendum1.html): "For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant may map the code value sequence C0 80 (11000000₂ 10000000₂) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (00000000₂) instead." This was the passage that was deleted as part of Corrigendum #1. -- Doug Ewell | Thornton, CO, US | ewellic.org