RE: What does it mean to "not be a valid string in Unicode"?

Whistler, Ken Mon, 07 Jan 2013 17:58:06 -0800

Martin,

The kind of situation Markus is talking about is illustrated particularly well 
in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to 
this issue,:

http://www.unicode.org/reports/tr10/#Handline_Illformed

When weighting Unicode 16-bit strings for collation, you can, of course, always 
detect an unpaired surrogate and return an error code or throw an exception, 
but that may not be the best strategy for an implementation.

The problem derives in part from the fact that for sorting, the comparison 
routine is generally buried deep down as a primitive comparison function in 
what may be a rather complicated sorting algorithm. Those algorithms often 
assume that the comparison routine is analogous to strcmp(), and will always 
return -1/0/1 (or negative/0/positive), and that it is not going to fail 
because it decides that some byte value in an input string is not valid in some 
particular character encoding. (Of course, the calling code needs to ensure it 
isn't handing off null pointers or unallocated objects, but that is par for the 
course for any string handling.)

Now if I want to adopt a particular sorting algorithm so it uses a 
UCA-compliant, multi-level collation algorithm for the actual string 
comparison, then by far the easiest way to do so is to build a function 
essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, 
string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit 
strings. If I introduce a string validation aspect to this comparison routine, 
and return an error code or raise an exception, then I run the risk of 
marginally slowing down the most time-critical part of the sorting loop, as 
well as complicating the adaptation of the sorting code, to deal with extra 
error conditions. It is faster, more reliable and robust, and easier to adapt 
the code, if I simply specify for the weighting exactly what happens to any 
isolated surrogate in input strings, and compare accordingly. Hence the two 
alternative strategies suggested in Section 7.1.1 of UTS #10: either weight 
each maximal ill-for!
 med subsequence as if it were U+FFFD (with a primary weight), or weight each 
surrogate code point with a generated implicit weight, as if it were an 
unassigned code point. Either strategy works. And in fact, the conformance 
tests in CollationTest.zip for UCA include some ill-formed strings in the test 
data, so that implementations can test their handling of them, if they choose.

So in this kind of a case, what we are actually dealing with is: garbage in, 
principled, correct results out. ;-)

--Ken

> -----Original Message-----

> On 2013/01/08 3:27, Markus Scherer wrote:
> 
> > Also, we commonly read code points from 16-bit Unicode strings, and
> > unpaired surrogates are returned as themselves and treated as such (e.g.,
> > in collation). That would not be well-formed UTF-16, but it's generally
> > harmless in text processing.
> 
> Things like this are called "garbage in, garbage-out" (GIGO). It may be
> harmless, or it may hurt you later.
> 
> Regards,   Martin.

RE: What does it mean to "not be a valid string in Unicode"?

Reply via email to