[EMAIL PROTECTED] wrote: > Any suggestions on what the right way to deal with "surrogate" codepoints > in this algorithm? They should not occur in the data, but what if they do?
Either encode them as 3-byte UTF-8, or throw an exception etc. Note that ISO 10646-UTF-8 forbids encoding them at all, and it looks like Unicode-UTF-8 is going that direction. markus

