Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Wed, 13 Oct 2010 18:01:22 -0700

At least I was right about .NET  after all these years and like Java it uses
a borked system.


 

>BitC does not provide mutable strings.

 

That's good

 

>The get next/previous operation speed is much more important that finding
the initial location.

 

True but there is a lot of code like  Index of start, index of end  take
substring.  This could be horrible say for </body> on a typical 2-3K html
page and even from </body>. would be bad. 

 

I know it's not really O(n) I just used it for want of a better word to
indicate developers sending an index which is scanned from the start for a
non fixed char representation  , in most operations you can work of the
result of the find within the library but when the user communicates it to
the lib you have to either use

 

-          Additional storage for an indexes ( which is bad and complicated)


-          Use a complex index type say a union with char and byte offset (
in theory the compiler should make this just as efficient and it
communicates well  between programmer and lib ) . You could overcome legacy
issues here by setting the index being the tradition char/code point rather
than byte offset..

-          Cache a single char offset to  byte offset lookup for large
strings.

-          Scan from the start.

-          Use byte offsets and do legacy code on a ToFixedCharArray()  , I
kind of like this since a lot of C legacy code relies on mutable strings. 

 

I assume the O(log)n is referring to the fact that in many cases the search
is not from the start. ..

 

Lastly is it a good idea supporting multiple underlying schemes aside from
legacy support methods like ToFixedCharArray() ? Java and .NET  have
survived without it and having single schemes helps interop.  Eg a >a byte
code file ( .NET assembly or windows dll) will work on any machine  but with
different  possible  internal storage schemes this would not be possible .
If your saying we leave it up to the lib that doesn't really change things
as it just moves the discussion to that point.   Previously the string
document made a strong statement that BitC would use UCS-2 or USC-4  and
hence fixed with chars by adding UCS-1 that part of the document doesn't say
much anymore.eg the runtime will use Unicode and may use byte or char
indexing.

 

 

Hope im not railroading the development by questioning this  J  Just think
the benefits of UTF-8 are strong when not  dealing with legacy support.

 

 

 

Ben

 

 

UCS-2  which offers O(1) indexing and finds but cant represent most Asian
chars requiring non standard encoding upon the internal string
representation and takes 2 bytes storage per character.
UTF-8 With O(n) indexing  which allows the developer to refer to the
character. Note on x86 you can use a fast SSE2 0x10 bit pattern scan to
count characters quicker.
UTF-8  with O(1) byte indexing with more runtime method focus and
ToFixedCharArray methods for char indexing.


Good list, but incomplete. Ropes with O(log(n)) indexing work just find in
practice.

Also worth noting that indexing is almost never random. The get
next/previous operation speed is much more important that finding the
initial location.

shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to