Re: [nodejs] decodeURIComponent vs. binary Buffer.toString() + UTF-8 replacement chars

Taylor Hughes Fri, 20 Jul 2012 15:54:35 -0700

v0.6.18 — we haven't bumped our project up to 0.8 yet.

Just tried with 0.8.3 and you're right — looks good in 0.8. Didn't think an
upgrade would change this particular behavior so I didn't try it out. :)


Amazing. Thanks!

-t



On Fri, Jul 20, 2012 at 3:48 PM, Marcel Laverdet <[email protected]>wrote:

> What version of node? This is what I get:
>
> > var moji1 = (new Buffer('\xf0\x9f\x8d\x94', 'binary')).toString('utf-8');
> > var moji2 = (new Buffer('\u00f0\u009f\u008d\u0094',
> 'binary')).toString('utf-8');
> > var moji3 = decodeURIComponent('%F0%9F%8D%94');
> > moji1 == moji2
> true
> > moji2 == moji3
> true
>
>
> On Fri, Jul 20, 2012 at 1:48 PM, Taylor Hughes <[email protected]> wrote:
>
>> Hi nodejs group!
>>
>> I was just wrestling with a bug in our app — concerning an iPhone emoji
>> => multipart POST to a node.js backend (decoding with formidible library) —
>> and came across the following Interesting Case™.
>>
>> The bug was: emoji chars POSTed from an iPhone, as part of a multipart
>> request, were being converted into \ufffd (UTF-8 replacement) chars,
>> whereas with form-encoded POSTs they were not.
>>
>> From this behavior I isolated the following interesting snippet:
>>
>> // This is an emoji character POSTed by an iPhone:
>> var binary = '\u00f0\u009f\u008d\u0094';
>> // The same binary string, urlencoded byte for byte (what you get with a
>> form-encoded POST of the same thing):
>> var urlencoded = '%F0%9F%8D%94';
>>
>> // Convert from the binary string
>> var utf8 = new Buffer(binary, 'binary').toString('utf-8');
>>
>> // Convert from the urlencoded version of the same thing
>> var utf8uri = decodeURIComponent(urlencoded);
>>
>> // Results are not the same:
>> utf8 == utf8uri // false
>>
>> // utf8    => "\ufffd" (UTF-8 replacement character)
>> // utf8uri => "\ud83c\udf54" (characters the iPhone can understand as the
>> original emoji)
>>
>>
>> (Note that normal multibyte UTF-8 characters go through both the same
>> way, and seem to come out fine in both cases.)
>>
>> I'm mostly curious about why this happens — namely why
>> decodeURIComponent() is seemingly more permissive with UTF-8 decoding than
>> other mechanisms like StringDecoder() and Buffer.toString() — and if
>> there's a way to preserve strange UTF-8 characters using those mechanisms
>> too.
>>
>> Thanks!
>> Taylor
>>
>>  --
>> Job Board: http://jobs.nodejs.org/
>> Posting guidelines:
>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>> You received this message because you are subscribed to the Google
>> Groups "nodejs" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>
>
>  --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] decodeURIComponent vs. binary Buffer.toString() + UTF-8 replacement chars

Reply via email to