Re: Need to do some "dirty" UTF-8 handling

Vladimir Panteleev Sat, 25 Jun 2011 06:15:24 -0700

On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <[email protected]> wrote:

Anyone have a good workaround? For instance, maybe a function that'lltakein a byte array and convert *all* invalid UTF-8 sequences to auser-selected
valid character?

I tend to do this a lot, for various reasons. By my experience, a greatpart of string-handling functions in Phobos will work just fine withstrings containing invalid UTF-8 - you can generally use your intuitionabout whether a function will need to look at individual characters insidethe string. Note, though, that there's currently a bug in D2/Phobos (6064)which causes std.array.join (and possibly other functions) to treatstrings as not something that can be joined by concatenation, and do acharacter-by-character copy (which is both needlessly inefficient and willchoke on invalid UTF-8).

When I really need to pass arbitrary data through string-handlingfunctions, I use these functions:

/// convert any data to valid UTF-8, so D's string functions can properlywork on it

string rawToUTF8(string s)
{
        dstring d;
        foreach (char c; s)
                d ~= c;
        return toUTF8(d);
}

string UTF8ToRaw(string r)
{
        string s;
        foreach (dchar c; r)
        {
                assert(c < '\u0100');
                s ~= c;
        }
        return s;
}

( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )

Of course, it would be nice if it'd be possible to only convert INVALIDUTF-8 sequences. According to Wikipedia, the invalid Unicode code pointsU+DC80..U+DCFF are often used for encoding invalid byte sequences. I'dguess that a proper implementation will need to guarantee that a roundtripwill always return the same data as the input, so it'd have to "escape"the invalid code points used for escaping as well.


--
Best regards,
 Vladimir                            mailto:[email protected]

Re: Need to do some "dirty" UTF-8 handling

Reply via email to