Re: [bitc-dev] String encoding, again

Jonathan S. Shapiro Wed, 16 Mar 2011 02:33:51 -0700

So you seem to be making several different and good points here. Let's see
if I can summarize:


1. The really important operations are find, substring, and perhaps
regexp-match. For each of these, the cost of a complex structure search to
locate the starting point is OK, and the matching itself is inherently O(n)
or worse.

2. Large strings must be fragmented in any case for GC reasons (my
addition).

3. There just isn't any sensible way to get constant indexing with
reasonable memory consumption.

The root problem is mostly that [3] violates expectations for old fart
western gaijin.

As they say in some parts of China: "Little dogs eat shit." (Roughly: "stuff
happens, what can you do?")

What happens if indexing on a string doesn't return a char, but instead
returns a subrange object. The subrange object has a get method that returns
a pair consisting of the char and a subrange starting at the next position.
This gives us constant time linear access up to the end of a run. By
breaking the index-of-char assumption up front, it forces programmers to
code to a sensible API, and it leaves the underlying string representation
something that we are free to revise. It also tends to push the initial
indexing step out of the inner loop.

There is a corner case with this. One is: what happens when you get() at end
of fragment. That clearly needs to fail in either an in-band or out-of-band
way, but the check can get very expensive within a loop. But I think this is
no worse than the vector bounds check problem that we already have
elsewhere.

Note that the subrange objects can be typed according to what their get()
method returns (e.g. code unit or code point), which helps us preserve
well-formedness with minimal checking.

Reactions?

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to