Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Tue, 12 Oct 2010 19:09:44 -0700

Re UTF8 design you are correct   , I got mixed between , .when .NET was
designed no one used UCS-1 and UCS-2 was common and the fact it wasn't out
when windows was designed..


 

Looking further , UCS-2 is now regarded as obsolete as a document
representation and UTF-16 is not the same as it has variable sized
extensions.  ( Note all UCS-2 is readable by  UTF-16 but not the reverse)
yet basic indexing of variable sized format UTF-8 or UTF-16 is misleading to
developers as you nearly always need to do a O(n) scan from the start this
means you need different methods to handle it optimally .  If you want to
allow indexible strings I would suggest 2 strings but you can do all the
indexing you need with char[] ( or utf32[] etc  . 

 

While Java does have excellent XML parsers there are plenty of good C  ones
which do utf-8. Libxml2-SAX blows away Java ones by 30-50%  ,working in
UCS-2 means you may not be able to meet your c performance goal... In Java
land sTAX apis are common even though they give inferior performance they
have an easier to use API.  Anyway for bitc I don't see this strong java
base as an issue  as there is plenty of good c (utf-8) parsers ( and a few
C++ sTAX)  and you can easily write a wrapper with minimal impact.  If UCS-2
was common and strong I would consider this argument more strongly but it's
a legacy standard and Java , windows and hence  .NET are burdened with to
and from USC-2 conversion costs. There are no USC-2 documents anymore and a
USC-2 system which can't do UTF-8 , UTf-16 or UTF-32 representations  is
even illegal in China.

 

 

2. The model I propose is very careful not to take any position that commits
the implementation to a particular representation. I'ld note that the IBM
ICU components have a very strong string implementation that satisfies all
of the concerns you raise while retaining perfectly fine in-memory space
performance

 

Java still suffers from excessive memory usage on embedded devices and their
SAX xml parsers are still inferior to C.  Regarding taking  a position that
is true but note as I said an  indexer on a string implies to a developer a
fixed with implementation which can only be ASCII , UCS-2 , UCS-4 and UTF32
without causing developers to write unexpectedly poorly  performing code for
UTF-8 and UTF-16 .  If you exclude a public indexer from string ( and just
use a to and from char[] ) then the std lib can handle indexing as needed
and a dev that implements  indexing will have to be more carefully of the
format. 

 

eg if char is utf 8   you can have string have an underlying representation
of  char[] however all the lib methods are on string this provides a number
of additional  benefits  eg

 

-          String can be copied to char[] at  very low cost  ( cast is
possible but you lose the immutability)

-          Programmers will use indexing on strings only when needed relying
more on the library ,this  subtlety improves code quality this is very
obvious in the MS world ( and .NET uses UCS-2 so could have used indexers) 

-          Strings are immutable  , providing GC benefits as well as multi
threading esp the diabolical string changed  by other thread issue. 

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to