On Sunday 13 March 2011 01:57:12 ZY Zhou wrote: > Hi, > > I wrote a small program to read and parse html(charset=UTF-8). It worked > great until some invalid utf8 chars appears in that page. > When the string is invalid, things like foreach or std.string.tolower will > just crash. > this make the string type totally unusable when processing files, since > there is no guarantee that utf8 file doesn't contain invalid utf8 chars. > > So I made a utf8 decoder myself to convert char[] to dchar[]. In my > decoder, I convert all invalid utf8 chars to low surrogate code > points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 > codes, I'm still able to know which part of the string is invalid. > Besides, after processing the dchar[] string, I still can convert it back > to utf8 char[] without affecting any of the invalid part. > > But it is still too easy to crash program with invalid string. > Is it possible to make this a native feature of string? Or is there any > other recommended method to solve this issue?
Check out std.utf. It has the functions for dealing with unicode stuff. - Jonathan M Davis