Corrections

1)

I stated .NET uses UCS-2 but it uses UTF-16 ( never realized all those
indexes would take O(n) to find the position) 

Also Windows is converting the internals from UCS-2 to UTF-16 and has been
since Windows 2000. 

Perl uses UTF8

Java originally used UCS-2, and added UTF-16 supplementary character support
in J2SE 5.0. 

All these schemes use O(n) indexing. I see no one who does what I proposed
of byte offset O(1) indexes ( internal to the array) and only have char
index from the ToArray method ( except inside the lib ). 


2) I was also under the impression that BitC offered C style mutable
strings. So when I suggested removing index from string and convert to array
that was what I meant. 



Anyway the only viable options available are basically

UCS-2  which offers O(1) indexing and finds but cant represent most Asian
chars requiring non standard encoding upon the internal string
representation and takes 2 bytes storage per character. 
UTF-8 With O(n) indexing  which allows the developer to refer to the
character. Note on x86 you can use a fast SSE2 0x10 bit pattern scan to
count characters quicker.
UTF-8  with O(1) byte indexing with more runtime method focus and
ToFixedCharArray methods for char indexing. 

If we go with the 70% DB is string figures ( and I would say objects are the
same , at least for business objects as they map to the same ) then 1 Gig of
objects or DB  in UTF-8 would be 1.7 Gig in UCS-2 and 3.1 Gig in UCS-4.


Ben 

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to