Re: [bitc-dev] String encoding, again

Jonathan S. Shapiro Wed, 16 Mar 2011 15:04:12 -0700

On Wed, Mar 16, 2011 at 4:49 AM, Ben Kloosterman <[email protected]> wrote:


>  Basically I think the world was sold a dummy with USC-2 , windows , c etc
> followed and we would have been better of staying with C char and using a
> standard form of encoding ( which we are now with  UTF-8 ) but most OS and
> languages get the baggage.
>

I tend to agree, but given the fact that this particular dummy is now part
of our compatibility baggage, we kind of have to deal with it.


>     So you seem to be making several different and good points here. Let's
> see if I can summarize:
>
> 1. The really important operations are find, substring, and perhaps
> regexp-match. For each of these, the cost of a complex structure search to
> locate the starting point is OK, and the matching itself is inherently O(n)
> or worse.
>
> Also note most Replace and find have a start index to start from and usage
> of this is very common on mid to large size strings.
>
Yes. This is the case where O(1) indexing is relevant - to find start of
string. But it's also a case where you are committed to making a procedure
call anyway, so front-loading the find/replace with a sufficiently brief
traversal to find start of string may not be the end of the world.

It's really the benchmark cases that worry me.

>   3. There just isn't any sensible way to get constant indexing with
> reasonable memory consumption.
>
  And the copy overhead is reasonable,  strings is probably 30-60% of an
> apps memory halving it would have nice benefits like cache usage etc.
>
Not sure why we are talking about copy overheads. How did that get into
this?


>   What happens if indexing on a string doesn't return a char, but instead
> returns a subrange object. The subrange object has a get method that returns
> a pair consisting of the char and a subrange starting at the next position.
> This gives us constant time linear access up to the end of a run.
>
> Interesting option in most cases Java and .NET either a new or interred
> object is returned , its easy on the GC and its not a performance issue.
> Reusing the same string data maybe a high performance lib which would  very
> rarely be user written.
>
I think you misread me. I wasn't referring to string creation. I was
referring to string *indexing*. Ignore UCS sizes for a moment; if you create
a 64Kbyte string, you're basically *done* talking about interactive pause
times (because of copy delays). So even if the unit size is uniform, a large
string must be made of chunks.

At the same time, we don't want to have a procedure call for every single
character indexing operation. So we need to get that kind of loop turned
into something that looks/smells like:

  foreach chunk in string
    while chunk.notEmpty()
      (c, chunk) = chunk[0], chunk.rest()

If we can get people to adopt this idiom, then we can further leverage
chunks to deal with UCS size change issues.


>   There is a corner case with this. One is: what happens when you get() at
> end of fragment. That clearly needs to fail in either an in-band or
> out-of-band way, but the check can get very expensive within a loop. But I
> think this is no worse than the vector bounds check problem that we already
> have elsewhere.
>
> Yes that’s a bigger issue .
>
Not really. What I meant is that the two issues are actually one and the
same - detecting dynamic end of vector.

>   Not sure if you want to ... just  use a type class to represent the
> different string types.
>
I'm very hesitant about using static typing too heavily here. Strings are
processed in lots and lots of places, and having all of that code
proliferate in replacted form seems potentially very unhealthy.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to