2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode <unicode@unicode.org>:
> > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode <unicode@unicode.org> > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is > translated into UTF-32 in a way that the original octet sequence can be > restored. Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ? In all cases this would require extensions on the 3 standards (which MUST be interoperable), then you'll shoke on new validation rules for these 3 standards for these extensions, and new ill-formed sequences that you won't be able to convert interoperably. Given the most restrictive condition in UTF-16 (which is still the most widely used internal representation), such extensions would be very complex too manage. There's no solution, such extensions in any one of them are then undesirable and can only be used privately (but without interoperating with the other 2 representations), so it's impossible to make sure the original octet sequences can be restored. Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot be part of the 3 standard UTF, but may be part of a distinct encoding, not fully compatible with the 3 standards.