Re: Full Unicode based on UTF-16 proposal

Roger Andrews Mon, 26 Mar 2012 05:16:07 -0700

Maybe String.isValid is just not generally useful enough. I accept thepoint that you don't add APIs simply to flag an issue, (there has to be moreweighty justification to carry the trifle).

PS:

As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI /encodeURIComponent's lead and throw an exception. Maybe that's the wrongthing to do?

My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed theright thing to do. Thanks for the link which explains why.

Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, sameissues as above really.


--------------------------------------------------
From: "Norbert Lindenberg"

Let's see:
- Conversion to UTF-8: If the string isn't well-formed, you wouldn'trefuse to convert it, so isValid doesn't really help. You still have tolook at all code units, and convert unpaired surrogates to the UTF-8sequence for U+FFFD.
- Conversion from UTF-8: For security reasons, you have to check forwell-formedness before conversion, in particular to catch non-shortestforms [1].
- HTML form data: Same situation as conversion to UTF-8.

- Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.
I don't think we'd add API just to flag an issue - that's whatdocumentation is for.
Norbert

[1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit



On Mar 25, 2012, at 1:57 , Roger Andrews wrote:
I use something like String.isValid functionality in a transcoder that
converts Strings to/from UTF-8, HTML Formdata (MIME type
application/x-www-form-urlencoded -- not the same as URI encoding!), and
Base64.
Admittedly these currently use 'encodeURI' to do the work, or it justdrops
out naturally when considering UTF-8 sequences.

(I considered testing the regexp
/^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
against the input string.)
Maybe the function is too obscure for general use, although its presencedoes flag up the surrogate-pair issue to developers.
--------------------------------------------------
From: "Norbert Lindenberg" <[email protected]>
It's easy to provide this function, but in which situations would it be
useful? In most cases that I can think of you're interested in far more
constrained definitions of validity:
- what are valid ECMAScript identifiers?
- what are valid BCP 47 language tags?
- what are the characters allowed in a certain protocol?
- what are the characters that my browser can render?

Thanks,
Norbert


On Mar 24, 2012, at 12:12 , David Herman wrote:
On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
Concerning UTF-16 surrogate pairs, how about a function like:
   String.isValid( str )
to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().
No need for it to be a class method, since it only operates on strings.
We could simply have String.prototype.isValid(). Note that it wouldwork
for primitive strings as well, thanks to JS's automatic promotion
semantics.

Dave

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to