Rich Felker wrote: > hope this isn't too off-topic -- i'm working on a utf-8 implementation > and trying to decide what to do with byte sequences that are > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff, > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as > illegal sequences (EILSEQ) or decoded as ordinary characters? is there > a good reference on the precedents?
The three cases are probably best treated separately: - The range 0xd800-0xdfff. You should catch and reject them as invalid when you are programming a conversion to UCS-2 or UTF-16, for example UTF-8 -> UTF-16 or UCS-4 -> UTF-16 Otherwise it becomes possible for malicious users to create non-BMP characters at a level of processing where earlier stages of processing did not see them. In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff. - For the other two ranges, the advice is dictated merely by consistency. Most software layers treat 0xfffe-0xffff like unassigned Unicode characters, therefore there is no need to catch them. The range >= 0x110000, I would catch and reject as invalid. Some time ago I had a crash in an application because the first level of processing rejected only values >= 0x80000000, with a reasonable error message, and later processing relied on valid Unicode and called abort() when a character code >= 0x110000 was seen. Making the first level as strict as the later one fixed this. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/