On Fri, Sep 5, 2014 at 10:08 AM, Mark Hahn <[email protected]> wrote:
>> So if I find \uFFFD as the last character of a valid but truncated utf8
>> buffer and I strip it, I should always end up with a valid string, right?
>>
>> That was an awkward sentence. Let me try in code. If buf is the first
>> 512 bytes of a long utf8 file will the following always produce a valid
>> string?
>
>
> str = buf.toString();
> if (str[str.length-1] is '\uFFFD') str = str.slice(0, -1);
Yes, unless there already was a replacement character in the input.
If you want to be sure, you can use
StringDecoder#detectIncompleteChar(). It's not documented but it
takes a buffer as its argument:
var buffer = /* ... */;
var StringDecoder = require('string_decoder').StringDecoder;
var dec = new StringDecoder('utf8');
dec.detectIncompleteChar(buffer);
if (dec.charReceived < dec.charLength) {
// Partial character sequence.
}
You can also implement the algorithm yourself if you don't want to
depend on an undocumented function. UTF-8 is a self-synchronizing
run-length encoding; you can figure out the length of the character by
looking at the last one to three bytes. In a nutshell:
1. If c & 0xC0 < 0x80, then it's a single-byte character.
2. If c & 0xC0 == 0xC0, then it's the start of a multi-byte character.
You can figure out its length by looking at the other bits.
3. If c & 0xC0 == 0x80, then it's part of a multi-byte character.
Backtrack until you find a byte that satisfies criterion 2 (but don't
backtrack more than three bytes.)
--
Job board: http://jobs.nodejs.org/
New group rules:
https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/nodejs/CAHQurc9ZcWAZDoWTPk1JE%3DxEbAgkV0STp7iysvxp8VYP7m1Emw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.