Markus Scherer wrote: > Doug Ewell wrote: > > It may be that broken UTF-16 text doesn't appear that often in the > > realworld. > > 16-bit Unicode is convenient in that when you find an unpaired surrogate > (that is, it's not well-formed UTF-16) you can usually just treat it like > a surrogate code point which normally has default properties much like an > unassigned code point or noncharacter. It case-maps to itself, normalizes > to itself, has default Unicode property values (except for the general > category), etc. > > In other words, when you process 16-bit Unicode text it takes no effort to > handle unpaired surrogates, other than making sure that you only assemble a > supplementary code point when a lead surrogate is really followed by a trail > surrogate. Hence little need for cleanup functions -- but if you need one, > it's trivial to write one for UTF-16.
Thank you! This is what I've always understood about the design of the UTFs: they're generally quite robust. One errant character doesn't make the whole text unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's reasonably straightforward to handle anomalies. So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 file to a UTF-8 file and it died immediately on the first text file I tested it on. I got this error message: UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24, <$utf16_dat_fh> line 119. So I checked the documentation (http://search.cpan.org/dist/Encode/Unicode/Unicode.pm#Error_Checking) and read this: Unlike most encodings which accept various ways to handle errors, Unicode encodings simply croaks. ... Unlike other encodings where mappings are not one-to-one against Unicode, UTFs are supposed to map 100% against one another. So Encode is more strict on UTFs. Consider that "division by zero" of Encode :) I see nothing to grin about. Division by zero? Seriously? This effectively means I can't use Perl to transcode Unicode, at least not in the imperfect world *I* live in. And GNU iconv is no better. It fails to transcode the same file with an even more laconic error message: iconv: Data.txt: cannot convert I guess I should appeal to the maintainer of the Perl core Encode module to loosen the shackles a bit, eh? Thank you all for your very helpful responses. Jim Monty