Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Jim Monty Thu, 04 Nov 2010 15:07:03 -0700

Markus Scherer wrote:
> Doug Ewell wrote:
> > It may be that broken UTF-16 text doesn't appear that often in the
> > realworld.
>
> 16-bit Unicode is convenient in that when you find an unpaired surrogate
> (that is, it's not well-formed UTF-16) you can usually just treat it like
> a surrogate code point which normally has default properties much like an
> unassigned code point or noncharacter. It case-maps to itself, normalizes
> to itself, has default Unicode property values (except for the general
> category), etc.
>
> In other words, when you process 16-bit Unicode text it takes no effort to 
> handle unpaired surrogates, other than making sure that you only assemble a 
> supplementary code point when a lead surrogate is really followed by a trail 
> surrogate. Hence little need for cleanup functions -- but if you need one, 
> it's trivial to write one for UTF-16.


Thank you! This is what I've always understood about the design of the UTFs: 
they're generally quite robust. One errant character doesn't make the whole 
text 
unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's 
reasonably straightforward to handle anomalies.

So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 
file 
to a UTF-8 file and it died immediately on the first text file I tested it on. 
I 
got this error message:

    UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
    <$utf16_dat_fh> line 119.

So I checked the documentation 
(http://search.cpan.org/dist/Encode/Unicode/Unicode.pm#Error_Checking) and read 
this:

    Unlike most encodings which accept various ways to handle errors,
    Unicode encodings simply croaks.

    ...

    Unlike other encodings where mappings are not one-to-one against
    Unicode, UTFs are supposed to map 100% against one another. So
    Encode is more strict on UTFs.

    Consider that "division by zero" of Encode :)

I see nothing to grin about. Division by zero? Seriously? This effectively 
means 
I can't use Perl to transcode Unicode, at least not in the imperfect world *I* 
live in.

And GNU iconv is no better. It fails to transcode the same file with an even 
more laconic error message:

    iconv: Data.txt: cannot convert

I guess I should appeal to the maintainer of the Perl core Encode module to 
loosen the shackles a bit, eh?

Thank you all for your very helpful responses.

Jim Monty

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to