In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation.
It has nothing to do with a "legacy of C pointer arithmetic". It does represent a pragmatic choice some time ago, but there is no need getting worked up about it. Human scripts and their representation on computers is quite complex enough; in the grand scheme of things the handling of surrogates in implementations pales in significance. Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller <stephan.stil...@gmail.com>wrote: > > Things like this are called "garbage in, garbage-out" (GIGO). It may be >>> harmless, or it may hurt you later. >>> >> So in this kind of a case, what we are actually dealing with is: garbage >> in, principled, correct results out. ;-) >> > > Wouldn't the clean way be to ensure valid strings (only) when they're > built and then make sure that string algorithms (only) preserve > well-formedness of input? > > Perhaps this is how the system grew, but it seems to be that it's > yet another legacy of C pointer arithmetic and > about convenience of implementation > rather than a > safety or > performance > issue. > > Stephan > > >