On 14 October 2010 11:59, Ben Kloosterman <[email protected]> wrote: > - Additional storage for an indexes ( which is bad and complicated)
Not necessarily, it can even be implemented once and then hidden from the user. One of the things I had planned to do for the GNU HURD was to implement file objects on which seeking to a line number was possible due to such an internal index. Once implementation is decoupled from interface, you can experiment with optimisations for the case you are interested in. Sparse indexes on large strings could turn that O(n) into O(log(n)) with a very small constant factor very quickly, and you can treat small strings differently. The question I would ask if that was an option is how you deal with the possibility that some strings have a different representation internally. Since BitC is supposed to be the antithesis of the 'smart runtime' that chooses optimal algorithms and data structures as it sees fit, and since the indirection necessary to support arbitrary implementations could be deemed too costly (whether that indirection occurs via a typeclass dictionary or some compile-time specialisation occurs), it might be better to push these sorts of options out to the user. If I really want an indexed UTF-8 string, I say that, and if I want a rope, I also have to tell you that. > - Use a complex index type say a union with char and byte offset ( > in theory the compiler should make this just as efficient and it > communicates well between programmer and lib ) . You could overcome legacy > issues here by setting the index being the tradition char/code point rather > than byte offset.. Mandating such an index on the string itself is more expensive than UCS-4 for small strings, which you have already said is too expensive. > - Use byte offsets and do legacy code on a ToFixedCharArray() , I > kind of like this since a lot of C legacy code relies on mutable strings. What would be nice would be a way to, given an offset to a codepoint or byte, ask for an offset in that neighbourhood that is known to be a character boundary*. That seemed to be the direction implied when talking about distinct iterators for codepoints and characters. * is that possible with UTF-16 ? > I assume the O(log)n is referring to the fact that in many cases the search > is not from the start. .. That's index time, not search time. Ropes are stored in a tree structure. > Lastly is it a good idea supporting multiple underlying schemes aside from > legacy support methods like ToFixedCharArray() ? Java and .NET have > survived without it and having single schemes helps interop. Eg a >a byte > code file ( .NET assembly or windows dll) will work on any machine but with > different possible internal storage schemes this would not be possible . > If your saying we leave it up to the lib that doesn’t really change things > as it just moves the discussion to that point. Previously the string > document made a strong statement that BitC would use UCS-2 or USC-4 and > hence fixed with chars by adding UCS-1 that part of the document doesn’t say > much anymore…eg the runtime will use Unicode and may use byte or char > indexing. Converting to a mutable byte array in any encoding always requires a copy, and so does converting back to a BitC string, otherwise we can't ensure immutability. So there is always going to be a cost when doing interop that expects to mutate a string buffer and all you have is a string. On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote: > As 32 bit chars is clearly unacceptable For some things. I don't think UCS-4 is a particularly bad choice for an in-memory encoding, it's the one I'd choose, and I'm not even from California. -- William Leslie _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
