Re UTF8 design you are correct , I got mixed between , .when .NET was designed no one used UCS-1 and UCS-2 was common and the fact it wasn't out when windows was designed..
Looking further , UCS-2 is now regarded as obsolete as a document representation and UTF-16 is not the same as it has variable sized extensions. ( Note all UCS-2 is readable by UTF-16 but not the reverse) yet basic indexing of variable sized format UTF-8 or UTF-16 is misleading to developers as you nearly always need to do a O(n) scan from the start this means you need different methods to handle it optimally . If you want to allow indexible strings I would suggest 2 strings but you can do all the indexing you need with char[] ( or utf32[] etc . While Java does have excellent XML parsers there are plenty of good C ones which do utf-8. Libxml2-SAX blows away Java ones by 30-50% ,working in UCS-2 means you may not be able to meet your c performance goal... In Java land sTAX apis are common even though they give inferior performance they have an easier to use API. Anyway for bitc I don't see this strong java base as an issue as there is plenty of good c (utf-8) parsers ( and a few C++ sTAX) and you can easily write a wrapper with minimal impact. If UCS-2 was common and strong I would consider this argument more strongly but it's a legacy standard and Java , windows and hence .NET are burdened with to and from USC-2 conversion costs. There are no USC-2 documents anymore and a USC-2 system which can't do UTF-8 , UTf-16 or UTF-32 representations is even illegal in China. 2. The model I propose is very careful not to take any position that commits the implementation to a particular representation. I'ld note that the IBM ICU components have a very strong string implementation that satisfies all of the concerns you raise while retaining perfectly fine in-memory space performance Java still suffers from excessive memory usage on embedded devices and their SAX xml parsers are still inferior to C. Regarding taking a position that is true but note as I said an indexer on a string implies to a developer a fixed with implementation which can only be ASCII , UCS-2 , UCS-4 and UTF32 without causing developers to write unexpectedly poorly performing code for UTF-8 and UTF-16 . If you exclude a public indexer from string ( and just use a to and from char[] ) then the std lib can handle indexing as needed and a dev that implements indexing will have to be more carefully of the format. eg if char is utf 8 you can have string have an underlying representation of char[] however all the lib methods are on string this provides a number of additional benefits eg - String can be copied to char[] at very low cost ( cast is possible but you lose the immutability) - Programmers will use indexing on strings only when needed relying more on the library ,this subtlety improves code quality this is very obvious in the MS world ( and .NET uses UCS-2 so could have used indexers) - Strings are immutable , providing GC benefits as well as multi threading esp the diabolical string changed by other thread issue. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
