On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <[email protected]> wrote:

Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected
valid character?

I tend to do this a lot, for various reasons. By my experience, a great part of string-handling functions in Phobos will work just fine with strings containing invalid UTF-8 - you can generally use your intuition about whether a function will need to look at individual characters inside the string. Note, though, that there's currently a bug in D2/Phobos (6064) which causes std.array.join (and possibly other functions) to treat strings as not something that can be joined by concatenation, and do a character-by-character copy (which is both needlessly inefficient and will choke on invalid UTF-8).

When I really need to pass arbitrary data through string-handling functions, I use these functions:

/// convert any data to valid UTF-8, so D's string functions can properly work on it
string rawToUTF8(string s)
{
        dstring d;
        foreach (char c; s)
                d ~= c;
        return toUTF8(d);
}

string UTF8ToRaw(string r)
{
        string s;
        foreach (dchar c; r)
        {
                assert(c < '\u0100');
                s ~= c;
        }
        return s;
}

( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )

Of course, it would be nice if it'd be possible to only convert INVALID UTF-8 sequences. According to Wikipedia, the invalid Unicode code points U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd guess that a proper implementation will need to guarantee that a roundtrip will always return the same data as the input, so it'd have to "escape" the invalid code points used for escaping as well.

--
Best regards,
 Vladimir                            mailto:[email protected]

Reply via email to