They are really "super private use" characters, available for definition within a given implementation or domain.
For example, in CLDR collation tables: The code point U+FFFF is tailored to have a weight higher than all other characters. This allows reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF” to include all strings starting with "sch" or equivalent. The code point U+FFFE is tailored to have a weight lower than all other characters. This allows for Interleaved_Levels<http://unicode.org/reports/tr10/#Interleaved_Levels> within code point space. So you can sort the following, and have it work nicely. ... sortKey = LastNameField + '\uFFFE' + FirstNameField ... If someone happens to include an FFFE in one of these fields to be collated, you'll get an odd ordering, but not a disaster. If you really care about that, you can ensure that you don't allow FFFE in those database fields, just as, for example, you might prevent U+0001 from being in the field. But you really can't block at a low level, otherwise I couldn't serialize out sortKey above into UTF8, a perfectly legitimate thing to do. Mark *— Il meglio è l’inimico del bene —* * * * [https://plus.google.com/114199149796022210033] * On Mon, Sep 19, 2011 at 15:26, Tom Christiansen <tchr...@perl.com> wrote: > Mark Davis ☕ <m...@macchiato.com> wrote > on Mon, 19 Sep 2011 14:41:49 PDT: > > > I agree with the first part, disallowing the irregular code sequences. > > Finding that Java allowed surrogates to sneak through in their UTF-8 > streams like that was quite odd. > > > As to the noncharacters, it would be a horrible mistake to disallow them. > > > Tom, a Java code converter is far too low a level for C9; if the > > converter can't handle them, it screws up all perfectly legitimate > > *internal*interchange. C9 is really for a very high level, eg don't > > put them into interchanged plain text, like a web page. I agree that > > it needs more clarification. > > Mark, thanks for taking the time to unravel that. It wasn't clear from > the specs where or perhaps even whether you should or should not disallow > the 66 noncharacter code points. A bit more clarity there would help. > > You bring up an interesting point. If you read a web page and want to use > some of the noncharacter code points as sentinels per their suggested use > during your internal processing, you have to be able to know that they > weren't there to start with. Yes, you can check, one at a time, till you > (hopefully!) find enough that aren't there that you can use them. But if > that were what you had to do, then you could do that with any set of code > points not just noncharacter ones. So that doesn't seem to make sense. > > People using UTF-8 or UTF-32 implementations can always steal non-Unicode > code points from above 0x1FFFFF for their own internal use *provided* they > never try to pass those along, but that won't work for UTF-16 even > internally. > > Is there anything that they can dependably use? It appears there is not. > > It's an interesting problem, and I see that it isn't as easily solved as > I had hoped it might be. If you can't guarantee that even the 66 > noncharacter code points won't be in your data stream, I'm thinking this > isn't going to be solvable at this level. It does make me wonder what > those 66 noncharacters code points really are for, then, so it's back to > rereading the specs again for me. > > thanks very much, > > --tom >