One minor point, and then I'll respond to the main one: > .NET uses UTF16 ( USC-2) since there was no UTF8 when it was designed .
Check your facts on this. UTF-8 dates back to 1992. That predates the first major release of Java, never mind .Net. But the problem in the large here is that ontogeny recapitulates philogeny. The fact that Java is so intimately tied to processing of XML means that large bodies of existing code are written to a UCS-2 indexing assumption. This is further amplified by the fact that nearly all useful character sets are encodable in UCS-2, and UCS4 font sizes aren't very tractable. The long and short of all of this is that we're stuck with legacy indexing schemes. Perhaps more importantly, we're stuck with a three-layer stack in which characters, code points, and code units can defy interpretation unless consistency rules are imposed at all three layers. So two addendums to my earlier four rules: 1. I had not intended to imply that strings would *not* include ucs1 indexing. If only for the sake of certain low-level memory operations, they need to be byte indexable. 2. The model I propose is very careful not to take any position that commits the implementation to a particular representation. I'ld note that the IBM ICU components have a very strong string implementation that satisfies all of the concerns you raise while retaining perfectly fine in-memory space performance. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
