Re: Code points vs Unicode scalar values

Anne van Kesteren Fri, 20 Sep 2013 06:00:50 -0700

On Fri, Sep 20, 2013 at 6:28 AM, Erik Corry <[email protected]> wrote:
> Just to be clear, V8 does not generate CESU-8 if you give it well formed
> UTF-16.


Sure.


> If you give it broken UTF-16 with unpaired surrogates you can either break
> the data or emit CESU-8.  In the first case, you overwrite the unpaired
> surrogates with some sort of error character code.  In the second case you
> can generate three-byte UTF-8 sequences that are not strictly legal.  The
> second option will preserve the data if you round-trip it into V8 again (or
> feed it to other apps that are liberal in what they accept), so that's what
> V8 currently does.

That's a bug. A utf-8 encoder should never emit byte sequences that
are not valid utf-8. You should emit U+FFFD as a byte sequence instead
for lone surrogates or terminate processing. Lone surrogates should
not round-trip through the encoding layer as you can create down-level
security bugs in unsuspecting decoders.


-- 
http://annevankesteren.nl/
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Code points vs Unicode scalar values

Reply via email to