Mark Davis ☕ <m...@macchiato.com> wrote on Mon, 19 Sep 2011 14:41:49 PDT:
> I agree with the first part, disallowing the irregular code sequences. Finding that Java allowed surrogates to sneak through in their UTF-8 streams like that was quite odd. > As to the noncharacters, it would be a horrible mistake to disallow them. > Tom, a Java code converter is far too low a level for C9; if the > converter can't handle them, it screws up all perfectly legitimate > *internal*interchange. C9 is really for a very high level, eg don't > put them into interchanged plain text, like a web page. I agree that > it needs more clarification. Mark, thanks for taking the time to unravel that. It wasn't clear from the specs where or perhaps even whether you should or should not disallow the 66 noncharacter code points. A bit more clarity there would help. You bring up an interesting point. If you read a web page and want to use some of the noncharacter code points as sentinels per their suggested use during your internal processing, you have to be able to know that they weren't there to start with. Yes, you can check, one at a time, till you (hopefully!) find enough that aren't there that you can use them. But if that were what you had to do, then you could do that with any set of code points not just noncharacter ones. So that doesn't seem to make sense. People using UTF-8 or UTF-32 implementations can always steal non-Unicode code points from above 0x1FFFFF for their own internal use *provided* they never try to pass those along, but that won't work for UTF-16 even internally. Is there anything that they can dependably use? It appears there is not. It's an interesting problem, and I see that it isn't as easily solved as I had hoped it might be. If you can't guarantee that even the 66 noncharacter code points won't be in your data stream, I'm thinking this isn't going to be solvable at this level. It does make me wonder what those 66 noncharacters code points really are for, then, so it's back to rereading the specs again for me. thanks very much, --tom