Re: [bitc-dev] String encoding, again

William ML Leslie Wed, 16 Mar 2011 21:45:52 -0700

On UCS-2: Sure, nobody likes it, codepoints vs code units, blah blah.
But if you disallow the poorly typed indexing operations we are
talking about here, the user need be none the wiser, as far as I can
see.


Given a strongly-typed index into a string, unless the encoding is
UTF-32, you are going to need to do some logic to eg determine when
you are dealing with surrogate pairs in UCS-2 or determine the length
of the character in UTF-8.  We seem to have agreed that the logic for
doing so is cheap enough that it *may* be a worthwhile trade-off for
the reduction in cache usage in common workloads, and that this is
worth benchmarking.

Where typesafe indexes (such as iterators in the non-vector case) are
used, which seems to be possible in the examples mentioned (regexp
search, substring), we are probably always talking about O(1) typical,
so I don't know where either of you are taking this discussion.

And of course, you will want to allocate the indexes on the stack as
far as iteration goes, but given different instances of the String
typeclass you don't know how large they will have to be.  One way to
deal with this is by making the iteration machinery special as in Go
or common lisp.

Is there some performance issue I'm not considering here?

-- 
William Leslie
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to