Re: [bitc-dev] String encoding, again

Ben Kloosterman Wed, 16 Mar 2011 20:53:57 -0700


Basically I think the world was sold a dummy with USC-2 , windows , c etc
followed and we would have been better of staying with C char and using a
standard form of encoding ( which we are now with  UTF-8 ) but most OS and
languages get the baggage. 

 

I tend to agree, but given the fact that this particular dummy is now part
of our compatibility baggage, we kind of have to deal with it.

 

Yes but translation is easy and nearly all apps do it we may actually be
able to not translate html/xml.

 

 

 So you seem to be making several different and good points here. Let's see
if I can summarize:

1. The really important operations are find, substring, and perhaps
regexp-match. For each of these, the cost of a complex structure search to
locate the starting point is OK, and the matching itself is inherently O(n)
or worse.

Also note most Replace and find have a start index to start from and usage
of this is very common on mid to large size strings.

Yes. This is the case where O(1) indexing is relevant - to find start of
string. But it's also a case where you are committed to making a procedure
call anyway, so front-loading the find/replace with a sufficiently brief
traversal to find start of string may not be the end of the world.

 

It's really the benchmark cases that worry me.

 

 

Yes  , for most apps it will be better  however for some micro benches  it
will be worse as these don't work with real work data ( which is often in
UTF-8) .  There are a few corner cases that are needed in the standard lib ,
one is you need some form of mutable string support to build larger strings
eg stringbuilder.  The other case is a string parsing lib / lexer such a lib
MAY be better of using USC-4  char arrays but with 4 times the data I'm not
so sure . Im pretty sure you don't want to work with the net /Java
USC-2/UTF16 as you will have the same issues as many libs eg indexing
doesn't work on some Asian chars which is very  poor IMHO . Making code
suitable for internationalization is just too hard at the moment. 

 

For the benchmark if you have these methods you just need to say create them
as UCS-2 char arrays which will loose the  full Asian char support but will
be ok in the benchmark and similar to .NET and Java .. This also works well
for interop.

 

Regarding interop SSE2-3 and convert from UTF-8 english to USC-2/UTF-16 at
rates of 8 or 16  chars per instruction .

 

 

3. There just isn't any sensible way to get constant indexing with
reasonable memory consumption.

And the copy overhead is reasonable,  strings is probably 30-60% of an apps
memory halving it would have nice benefits like cache usage etc. 

Not sure why we are talking about copy overheads. How did that get into
this?

 

Not sure  . Probably meant all strings being half the size would be good for
the GC copying overhead. 

 

One Q on large strings , since strings are immutable and cant contain
pointers ( like images) why not put large ones on the OS heap with malloc. .
Since they cant have references you can easily check when they have no
references in a mark.

 

 

 

 

What happens if indexing on a string doesn't return a char, but instead
returns a subrange object. The subrange object has a get method that returns
a pair consisting of the char and a subrange starting at the next position.
This gives us constant time linear access up to the end of a run. 

Interesting option in most cases Java and .NET either a new or interred
object is returned , its easy on the GC and its not a performance issue.
Reusing the same string data maybe a high performance lib which would  very
rarely be user written. 

I think you misread me. I wasn't referring to string creation. I was
referring to string indexing. 

 

I followed  , there are only a few .NET methods that return an index and its
cumbersome and introduces coding bugs to, you cant easily return a char at
an index and note since the char returned is 16 bit it cant express some
values .  Except for large strings im nor convinced range is better than
returning a sub string instead of a sub range object.  In most cases you do
the index as part of a more complex operation eg Replace , get sub string ,
Trim , pad etc  all of which return a new string.  Internally these methods
use a find and then do the rest  in one sweep , note here is you pick a
UCS-4 char  as a string , find will find it and then it will be replaces but
the char at that position is  not useful. 

 

char a =  str[index]     should be discouraged and moved to a non string
class eg some runtime methods that work on a UCS-2 or UCS-4 char array ,
note this is fine 

 

var str2= str.Replace( index  , newstr) ; 

 

I suppose what I'm saying is strings need to have C#/java level performance
and be for good and reliable coding , an alternative should be there for
high performance eg mutable  UCS-4 char array

 

Ignore UCS sizes for a moment; if you create a 64Kbyte string, you're
basically done talking about interactive pause times (because of copy
delays). So even if the unit size is uniform, a large string must be made of
chunks.

 

 

You still need to handle image blobs .

 

At the same time, we don't want to have a procedure call for every single
character indexing operation. So we need to get that kind of loop turned
into something that looks/smells like:

 

  foreach chunk in string

    while chunk.notEmpty()

      (c, chunk) = chunk[0], chunk.rest() 

 

If we can get people to adopt this idiom, then we can further leverage
chunks to deal with UCS size change issues.

 

It only needs to be adopted for large strings in high performance code.
Most strings are small eg a line of a file if we are saying we don't have
large strings I don't see the issue as mentioned with the GC above large
strings should be discouraged.. This discouragement alone may be beneficial
in string processing. 

 

 

Not sure if you want to ... just  use a type class to represent the
different string types.

I'm very hesitant about using static typing too heavily here. Strings are
processed in lots and lots of places, and having all of that code
proliferate in replacted form seems potentially very unhealthy.

 

 

Are you stating it would stress the type system .. optimizations would
remove  the unused code..and the libs will mostly use a single piece of
string code. ( Note I know nothing about how type classes are implemented in
the back end) . Type class will make it a less risky option.

 

Anyway we wont know until we do it ,  I'm currently  thinking its best to
use strings  as above (expect my best option to change in 10 min)   and
encourage  small strings , but also have some lib functions that work on
USC2 and USC-4 mutable char arrays , 99% of people will just use string but
something like a parser may read the document convert to  USC-2 or 32 (
depending on Chinese support/performance) and process it before converting
,it back similar to what C  does now.  I'm not convinced the array lib code
is needed as going from 8 bits to 32 is a big increase  and some parsers may
use regular expressions.

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to