Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Markus Scherer Thu, 04 Nov 2010 16:46:17 -0700

On Thu, Nov 4, 2010 at 2:52 PM, Jim Monty <jim.mo...@yahoo.com> wrote:


> > In other words, when you process 16-bit Unicode text it takes no effort
> to
> > handle unpaired surrogates, other than making sure that you only assemble
> a
> > supplementary code point when a lead surrogate is really followed by a
> trail
> > surrogate. Hence little need for cleanup functions -- but if you need
> one,
> > it's trivial to write one for UTF-16.
>
> Thank you! This is what I've always understood about the design of the
> UTFs:
> they're generally quite robust. One errant character doesn't make the whole
> text
> unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's
> reasonably straightforward to handle anomalies.
>
> So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16
> file
> to a UTF-8 file and it died immediately on the first text file I tested
> it on. I
> got this error message:
>
>     UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
>     <$utf16_dat_fh> line 119.
>

There is a difference between processing "16-bit Unicode text" and
converting to UTF-8 or UTF-32, and even well-formed UTF-16.

While processing 16-bit Unicode text which is not assumed to be well-formed
UTF-16, you can treat (*de*code) an unpaired surrogate as a mostly-inert
surrogate code point. However, you cannot *unambiguously* *en*code a
surrogate code point in 16-bit text (because you could not distinguish a
sequence of lead+trail surrogate code points from one supplementary code
point), and therefore it is not allowed to encode surrogate code points in
any *well-formed UTF*-8/16/32. [All of this is discussed in The Unicode
Standard, Chapter 3.]

So a converter is correct in treating an unpaired surrogate as an error. On
the other hand...

I guess I should appeal to the maintainer of the Perl core Encode module to
> loosen the shackles a bit, eh?
>

Any conversion library should offer options for *how to deal with* errors.
One way is to return an error, throw an exception, or equivalent. Another is
to replace the offending sequence with some substitution character (usually
U+FFFD when the target is a form of Unicode) and continue converting after
that.

If the conversion libraries you are using do not support this (I don't
know), then you could ask for such options. Or use conversion libraries that
do support such options (like ICU and Java).

Best regards,
markus

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to