I should have said "use appropriate error handling" instead of "convert 
unpaired surrogates to the UTF-8 sequence for U+FFFD". While using the 
replacement character is a reasonable default behavior, it's best to let the 
caller control the behavior. I'd assume that most callers would want as much 
information to pass through even if there's some stray unpaired surrogate in a 
string. If your converters just throw exceptions, then many callers will have 
to go through input strings themselves and remove unpaired surrogates that 
might have crept in.

For Base64, you could encode UTF-16 directly; you just have to make sure that 
encoder and decoder agree on the byte order.

Norbert


On Mar 26, 2012, at 5:16 , Roger Andrews wrote:

> Maybe String.isValid is just not generally useful enough.  I accept the point 
> that you don't add APIs simply to flag an issue, (there has to be more 
> weighty justification to carry the trifle).
> 
> 
> PS:
> As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI / 
> encodeURIComponent's lead and throw an exception.  Maybe that's the wrong 
> thing to do?
> 
> My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed the 
> right thing to do.  Thanks for the link which explains why.
> 
> Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same 
> issues as above really.
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
>> 
>> Let's see:
>> 
>> - Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse 
>> to convert it, so isValid doesn't really help. You still have to look at all 
>> code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.
>> 
>> - Conversion from UTF-8: For security reasons, you have to check for 
>> well-formedness before conversion, in particular to catch non-shortest forms 
>> [1].
>> 
>> - HTML form data: Same situation as conversion to UTF-8.
>> 
>> - Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.
>> 
>> I don't think we'd add API just to flag an issue - that's what documentation 
>> is for.
>> 
>> Norbert
>> 
>> [1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit
>> 
>> 
>> 
>> On Mar 25, 2012, at 1:57 , Roger Andrews wrote:
>> 
>>> I use something like String.isValid functionality in a transcoder that
>>> converts Strings to/from UTF-8, HTML Formdata (MIME type
>>> application/x-www-form-urlencoded -- not the same as URI encoding!), and
>>> Base64.
>>> 
>>> Admittedly these currently use 'encodeURI' to do the work, or it just drops
>>> out naturally when considering UTF-8 sequences.
>>> 
>>> (I considered testing the regexp
>>> /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
>>> against the input string.)
>>> 
>>> Maybe the function is too obscure for general use, although its presence 
>>> does flag up the surrogate-pair issue to developers.
>>> 
>>> --------------------------------------------------
>>> From: "Norbert Lindenberg" <[email protected]>
>>>> 
>>>> It's easy to provide this function, but in which situations would it be
>>>> useful? In most cases that I can think of you're interested in far more
>>>> constrained definitions of validity:
>>>> - what are valid ECMAScript identifiers?
>>>> - what are valid BCP 47 language tags?
>>>> - what are the characters allowed in a certain protocol?
>>>> - what are the characters that my browser can render?
>>>> 
>>>> Thanks,
>>>> Norbert
>>>> 
>>>> 
>>>> On Mar 24, 2012, at 12:12 , David Herman wrote:
>>>> 
>>>>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>>>> 
>>>>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>>>>   String.isValid( str )
>>>>>> to discover whether surrogates are used correctly in 'str'?
>>>>>> 
>>>>>> Something like Array.isArray().
>>>>> 
>>>>> No need for it to be a class method, since it only operates on strings.
>>>>> We could simply have String.prototype.isValid(). Note that it would work
>>>>> for primitive strings as well, thanks to JS's automatic promotion
>>>>> semantics.
>>>>> 
>>>>> Dave
>>>>> 

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to