Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Wed, 13 Oct 2010 22:29:50 -0700

 >On 14 October 2010 11:59, Ben Kloosterman <[email protected]> wrote:
 >> -          Additional storage for an indexes ( which is bad and
 >complicated)
 >
 >Not necessarily, it can even be implemented once and then hidden from
 >the user.
 >
 >One of the things I had planned to do for the GNU HURD was to
 >implement file objects on which seeking to a line number was possible
 >due to such an internal index. Once implementation is decoupled from
 >interface, you can experiment with optimisations for the case you are
 >interested in. Sparse indexes on large strings could turn that O(n)
 >into O(log(n)) with a very small constant factor very quickly, and you
 >can treat small strings differently.
 >
 >The question I would ask if that was an option is how you deal with
 >the possibility that some strings have a different representation
 >internally. Since BitC is supposed to be the antithesis of the 'smart
 >runtime' that chooses optimal algorithms and data structures as it
 >sees fit, and since the indirection necessary to support arbitrary
 >implementations could be deemed too costly (whether that indirection
 >occurs via a typeclass dictionary or some compile-time specialisation
 >occurs), it might be better to push these sorts of options out to the
 >user. If I really want an indexed UTF-8 string, I say that, and if I
 >want a rope, I also have to tell you that.

I do favour pushing this to the user  when its necessary but shouldnt the
default be reasonably fast and memory efficient ? Especially when most of
those strings are just storage and not being used /worked on.

Also I dont see BitC growing past a single runtime for a LONG time... :-)
I'm not sure it is meant to be the antithesis as I kind of like the smart
runtimes and the drop the file and it always runs rather than spending ages
hacking / configuring and trying to compile something . 

 >
 >> -          Use a complex index type say a union with char and byte
 >offset (
 >> in theory the compiler should make this just as efficient and it
 >> communicates well  between programmer and lib ) . You could overcome
 >legacy
 >> issues here by setting the index being the tradition char/code point
 >rather
 >> than byte offset..
 >
 >Mandating such an index on the string itself is more expensive than
 >UCS-4 for small strings, which you have already said is too expensive.

Certainly not on the string itself but in the standard lib  you could use a
64 bit value and the high bit whether it's a  byte index or char index when
setting the char index immediately convert to a byte. It wouldnt be that
expensive since it is only indexes on active worked on strings. It would
just allow most users to use char indexes and when needed the real byte
indexes. Im not really sold on this idea...it is probably easier to just say
they are byte indexes ..

 >
 >> -          Use byte offsets and do legacy code on a
 >ToFixedCharArray()  , I
 >> kind of like this since a lot of C legacy code relies on mutable
 >strings.
 >
 >What would be nice would be a way to, given an offset to a codepoint
 >or byte, ask for an offset in that neighbourhood that is known to be a
 >character boundary*. That seemed to be the direction implied when
 >talking about distinct iterators for codepoints and characters.

Yes and I certainly meant this,  the runtime would work with byte offsets ,
also I see no reason why it would ever return a non char boundary byte
offset.  So find ( string , offset) is still a valid API . The main gotcha
is operators eg index+3  needs to be incIndex(index,3)  , while not a
performance issue for an inline method it is a bit ugly and im not sure
about bitc but operator overloading can be expensive . 

 >> Lastly is it a good idea supporting multiple underlying schemes aside
 >from
 >> legacy support methods like ToFixedCharArray() ? Java and .NET  have
 >> survived without it and having single schemes helps interop.  Eg a >a
 >byte
 >> code file ( .NET assembly or windows dll) will work on any machine
 > but with
 >> different  possible  internal storage schemes this would not be
 >possible .
 >> If your saying we leave it up to the lib that doesnt really change
 >things
 >> as it just moves the discussion to that point.   Previously the string
 >> document made a strong statement that BitC would use UCS-2 or USC-4
 >and
 >> hence fixed with chars by adding UCS-1 that part of the document
 >doesnt say
 >> much anymoreeg the runtime will use Unicode and may use byte or char
 >>  indexing.
 >
 >Converting to a mutable byte array in any encoding always requires a
 >copy, and so does converting back to a BitC string, otherwise we can't
 >ensure immutability. So there is always going to be a cost when doing
 >interop that expects to mutate a string buffer and all you have is a
 >string.

Correct , but as I mentioned with UTF8 in many cases you can pass the
initial XML or HTML with no initial parsing  but you do incur the conversion
when you want it as arrays , so in the common worst case you're no worse off
than before and in the best UTF8 directly via byte index you are better off
. 
 >
 >
 >On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote:
 >> As 32 bit chars is clearly unacceptable
 >
 >For some things. I don't think UCS-4 is a particularly bad choice for
 >an in-memory encoding, it's the one I'd choose, and I'm not even from
 >California.

:-) 
I think the memory overhead is too bad in business application and DB
land... but it is a good candidate for the worked on  fixedchararrays . Note
I'm not viewing it from the point of view of Unix apps reading the data and
then finishing but more from memory hungry Business apps, App servers ,  Web
servers , SOA servers , DBs etc where the strings stay in memory for a long
time while not used.

Ben 

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to