Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to know which part of the string is invalid. Besides, after processing the dchar[] string, I still can convert it back to utf8 char[] without affecting any of the invalid part. But it is still too easy to crash program with invalid string. Is it possible to make this a native feature of string? Or is there any other recommended method to solve this issue? Thank you, --ZY Zhou