So you seem to be making several different and good points here. Let's see if I can summarize:
1. The really important operations are find, substring, and perhaps regexp-match. For each of these, the cost of a complex structure search to locate the starting point is OK, and the matching itself is inherently O(n) or worse. 2. Large strings must be fragmented in any case for GC reasons (my addition). 3. There just isn't any sensible way to get constant indexing with reasonable memory consumption. The root problem is mostly that [3] violates expectations for old fart western gaijin. As they say in some parts of China: "Little dogs eat shit." (Roughly: "stuff happens, what can you do?") What happens if indexing on a string doesn't return a char, but instead returns a subrange object. The subrange object has a get method that returns a pair consisting of the char and a subrange starting at the next position. This gives us constant time linear access up to the end of a run. By breaking the index-of-char assumption up front, it forces programmers to code to a sensible API, and it leaves the underlying string representation something that we are free to revise. It also tends to push the initial indexing step out of the inner loop. There is a corner case with this. One is: what happens when you get() at end of fragment. That clearly needs to fail in either an in-band or out-of-band way, but the check can get very expensive within a loop. But I think this is no worse than the vector bounds check problem that we already have elsewhere. Note that the subrange objects can be typed according to what their get() method returns (e.g. code unit or code point), which helps us preserve well-formedness with minimal checking. Reactions?
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
