Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Thu, 01 Jun 2017 01:17:31 -0700

On 31 May 2017, at 20:24, Shawn Steele via Unicode <unicode@unicode.org> wrote:
> 
> > For implementations that emit FFFD while handling text conversion and 
> > repair (ie, converting ill-formed
> > UTF-8 to well-formed), it is best for interoperability if they get the same 
> > results, so that indices within the
> > resulting strings are consistent across implementations for all the correct 
> > characters thereafter.
>  
> That seems optimistic :) 
>  
> If interoperability is the goal, then it would seem to me that changing the 
> recommendation would be contrary to that goal.  There are systems that will 
> not or cannot change to a new recommendation.  If such systems are updated, 
> then adoption of those systems will likely take some time.


Indeed, if interoperability is the goal, the behaviour should be fully 
specified, not merely recommended.  At present, though, it appears that we have 
(broadly) two different behaviours in the wild, and nobody wants to change what 
they presently do.

Personally I agree with Shawn on this; the presence of a U+FFFD indicates that 
the input was invalid somehow.  You don’t know *how* it was invalid, and 
probably shouldn’t rely on equivalence with another invalid string.

There are obviously some exceptions - e.g. it *may* be desirable in the context 
of browsers to specify the behaviour in order to avoid behavioural differences 
being used for Javascript-based “fingerprinting”.  But I don’t see why WHATWG 
(for instance) couldn’t do that.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to