Re: [bitc-dev] String encoding, again

Ben Kloosterman Wed, 16 Mar 2011 04:50:09 -0700

Basically I think the world was sold a dummy with USC-2 , windows , c etc
followed and we would have been better of staying with C char and using a
standard form of encoding ( which we are now with  UTF-8 ) but most OS and
languages get the baggage.


 

 

So you seem to be making several different and good points here. Let's see
if I can summarize:

1. The really important operations are find, substring, and perhaps
regexp-match. For each of these, the cost of a complex structure search to
locate the starting point is OK, and the matching itself is inherently O(n)
or worse.

 

Agree on that concatenation ( and generate a new object) is the most
important , the find , then sub string.   I think we really need to see how
immutable strings are used and the way they are used in .NET  , non O(1)
indexing would not be an issue - but it needs to be measured . 

By the way I see find  and replace on medium to large strings like this
especially on SandyBridge CPUs

For ( int I = 0 ; I < str.length ; i++) 

{

If (SSE.Match (str , 0x10 )  

ProcessNext 16To32Chars AsO(1) and increment I .  //SandyBridge can check 32
bytes , SSE 16 bytes

Else

    ProcessAsEscapeSequence() 

}

This would mean for most  ASCII chars common on web pages and general
English pages we only incur the cost of an if and an SSE instruction which
will be  very low overhead on an in cache memory operation like walking a
string , and may even do a big prefetch.

Also note most Replace and find have a start index to start from and usage
of this is very common on mid to large size strings.

2. Large strings must be fragmented in any case for GC reasons (my
addition).

Yes this is a good thing not constructs like this are common anyway XmlDoc ,
HtmlPage.

3. There just isn't any sensible way to get constant indexing with
reasonable memory consumption.

And the copy overhead is reasonable,  strings is probably 30-60% of an apps
memory halving it would have nice benefits like cache usage etc. 

The root problem is mostly that [3] violates expectations for old fart
western gaijin.

 

Nope the whole world was sold on this .. Look you can use UCS-2 and all your
problems go away , everyone can use the system, Then someone introduces a
new char like internet  and UCS-2 is useless. 

What happens if indexing on a string doesn't return a char, but instead
returns a subrange object. The subrange object has a get method that returns
a pair consisting of the char and a subrange starting at the next position.
This gives us constant time linear access up to the end of a run. 

 

Interesting option in most cases Java and .NET either a new or interred
object is returned , its easy on the GC and its not a performance issue.
Reusing the same string data maybe a high performance lib which would  very
rarely be user written. 

By breaking the index-of-char assumption up front, it forces programmers to
code to a sensible API, and it leaves the underlying string representation
something that we are free to revise. It also tends to push the initial
indexing step out of the inner loop.

 

Yes it and it also pushes this stuff to the Globalization layer which is
good ( or a high performance mutable lib),  consider adding a n English
string and an Arabic right to left  or languages where toUpper changes the
char.. Ie the lower layer should not be concerned with this.

There is a corner case with this. One is: what happens when you get() at end
of fragment. That clearly needs to fail in either an in-band or out-of-band
way, but the check can get very expensive within a loop. But I think this is
no worse than the vector bounds check problem that we already have
elsewhere.

Yes that's a bigger issue . 

Note that the subrange objects can be typed according to what their get()
method returns (e.g. code unit or code point), which helps us preserve
well-formedness with minimal checking.

Reactions?

 

I think its risky but the risks reward curve is firmly on the reward  side.
If it works  you will significantly reduce memory usage , which may increase
cache hit  , and reduce work compared to C (Unicode) , Java  and C#.   You
will also get a performance gain for light weight code parsing UTF-8 eg web
pages and XML  , this would be an interesting market.  On the cons if it
doesn't  work well your back to .NET style UTF16 or UCS-32 ( which is no big
deal)   and your high performance mutuable string building/ parsing lib
would be a bit more complex.

The fact its risky means when introduced the string side is likely to be
more flexible ( eg opaque type) .  Not sure if you want to reconsider the
chains idea in the light of this or just  use a type class to represent the
different string types.

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to