Re: [bitc-dev] Unicode and bitc

William Leslie Wed, 13 Oct 2010 20:07:16 -0700

On 14 October 2010 11:59, Ben Kloosterman <[email protected]> wrote:
> -          Additional storage for an indexes ( which is bad and complicated)

Not necessarily, it can even be implemented once and then hidden from the user.

One of the things I had planned to do for the GNU HURD was to
implement file objects on which seeking to a line number was possible
due to such an internal index. Once implementation is decoupled from
interface, you can experiment with optimisations for the case you are
interested in. Sparse indexes on large strings could turn that O(n)
into O(log(n)) with a very small constant factor very quickly, and you
can treat small strings differently.

The question I would ask if that was an option is how you deal with
the possibility that some strings have a different representation
internally. Since BitC is supposed to be the antithesis of the 'smart
runtime' that chooses optimal algorithms and data structures as it
sees fit, and since the indirection necessary to support arbitrary
implementations could be deemed too costly (whether that indirection
occurs via a typeclass dictionary or some compile-time specialisation
occurs), it might be better to push these sorts of options out to the
user. If I really want an indexed UTF-8 string, I say that, and if I
want a rope, I also have to tell you that.

> -          Use a complex index type say a union with char and byte offset (
> in theory the compiler should make this just as efficient and it
> communicates well  between programmer and lib ) . You could overcome legacy
> issues here by setting the index being the tradition char/code point rather
> than byte offset..

Mandating such an index on the string itself is more expensive than
UCS-4 for small strings, which you have already said is too expensive.

> -          Use byte offsets and do legacy code on a ToFixedCharArray()  , I
> kind of like this since a lot of C legacy code relies on mutable strings.

What would be nice would be a way to, given an offset to a codepoint
or byte, ask for an offset in that neighbourhood that is known to be a
character boundary*. That seemed to be the direction implied when
talking about distinct iterators for codepoints and characters.

* is that possible with UTF-16 ?

> I assume the O(log)n is referring to the fact that in many cases the search
> is not from the start. ..

That's index time, not search time. Ropes are stored in a tree structure.

> Lastly is it a good idea supporting multiple underlying schemes aside from
> legacy support methods like ToFixedCharArray() ? Java and .NET  have
> survived without it and having single schemes helps interop.  Eg a >a byte
> code file ( .NET assembly or windows dll) will work on any machine  but with
> different  possible  internal storage schemes this would not be possible .
> If your saying we leave it up to the lib that doesn’t really change things
> as it just moves the discussion to that point.   Previously the string
> document made a strong statement that BitC would use UCS-2 or USC-4  and
> hence fixed with chars by adding UCS-1 that part of the document doesn’t say
> much anymore…eg the runtime will use Unicode and may use byte or char
>  indexing.

Converting to a mutable byte array in any encoding always requires a
copy, and so does converting back to a BitC string, otherwise we can't
ensure immutability. So there is always going to be a cost when doing
interop that expects to mutate a string buffer and all you have is a
string.

On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote:
> As 32 bit chars is clearly unacceptable

For some things. I don't think UCS-4 is a particularly bad choice for
an in-memory encoding, it's the one I'd choose, and I'm not even from
California.

-- 
William Leslie

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to