Re: [bitc-dev] String encoding, again

Ben Kloosterman Wed, 16 Mar 2011 01:32:47 -0700

> Shap wrote 
>So the conclusion on all this appears to be that there simply isn't a good
choice of string type, which I think we already knew.


In scenarios like this its best to go light weight and offer the maximum
flexibility and leaving  it for the standard lib ( or a competitor standard
lib ) to sort  out..


>William asks, correctly: what are my objectives for strings?  I have three:

 > 0. The type string is the type assigned to string literals.
  1. String indexing should return a code point.
  2. String traversal (get next character) should be O(1)
  3. String fetch code generation should not require a test and
     conditional branch.

>After all of our discussion, I'm sorely tempted to just delete strings from
the type system altogether on the grounds that they are always the wrong
thing to use. Unfortunately we cannot do that because of condition [0].

Far to radical for me , as strings help developers write coherent libs ;
though if your saying not have it in the type system but put it say a string
type class in the standard lib then I do agree . 


>But if a string is always the wrong thing to use, then it seems to me that
at least it should produce naively sensible results. This seems to lead to
two sensible outcomes:

>Definition 1: A string is a representation paired with an encoding as
fixed-length units. It is left to the programmer to perform any encoding
conversion(s) required to obtain the indexing and sequencing properties
desired. The job of the library is to support those conversions.

>Definition 2: A string is a thing can be used by a naive programmer to
generate unsurprising outcomes for small to medium-sized inputs. This would
argue for UCS32.


>I'm sure this is badly broken somehow, but it's at least consistent and
explicable.

>So over to Ben, who will explain why I'm still being incredibly dense (and
thank God someone here has a handle on this stuff...


I dont really have a full handle on this either.. its not easy. Look at
.NET at that time UCS-2 was all the range and they are stuck with UTF-16 asa
result meaning all C# apps convert from UTF16 to UTF8 for web pages , xml
you name it. 

Yes I prefer 1 but If the type defines the encoding I see no reason for it
to be fixed index at the higher level ( though obviously at the lower level)


>Shap wrote ..
>Now to answer your last question, I think my real concern here is to get
O(1) time iteration through a string, which isn't the same thing as O(1)
random access at all. But that appears to require building some sort of
"string iterator" object whose use is known to the code generator. That cure
seems potentially worse than the disease.


A quick reply 

Maybe we are taking the wrong approach here.. lets look at how strings and
when indexing are used .. the fact your probably looking at compiler
parsing/ lexing  is not common string code.  And I would argue such code may
be better handled by a specialist lib rather than default string handling. 

In most cases,  code  works with very small strings and simply concatenates
strings  ( especially if you go with the stream type output you discussed
earlier  though a C/C# format string  with var args can be stored  as
separate strings and concatenated to produce a single large string )   ,
actually most developer indexing of strings should be avoided as it is
nearly always buggy hard to maintain code and in most cases have been
replaced with replace type functions returning a new string.  Since the
allocation cost is  relatively high compared to a  non O(1) index I dont
see the issue here .  .Note in C# anyway replace and index are nearly always
on a substring not a char

Regarding time iteration through the string , in most cases this is not
needed as you simply treat it like we did in C ie bytes with a higher
encoding . You certainly dont need a string  iterator that would be too
expensive and type classes should ensure calling the right function ,
though worst case a conditional would not be bad. 

Also a scheme like UTF8  ,  is specifically designed to check for  multi
byte chars and you can use the SSE3 instructions to help scanning 16  or
now 32 chars (YMM) in one instruction , if you have no escape chars you can
treat as O(1) index. 

Now if we look at conversion costs you can convert from UTF8 to UTF 16 or
UCS-4 very cheaply for most common characters ( again SSE3 YYM can convert
16  or 8  chars in one cycle covering most strings) . This means a special
high performance lib which uses lots of indexing may convert to a more
appropriate type , the fact that most original documents are UTF-8 and
parsed a line at a time by such  libs suggest it may not be bad using UTF-8
t internally ( as C# , C and Java always have the conversion cost) . 
Also for converting in .NET a lot of this work is done on char arrays and
ToCharArray() should probably always return UCS-4. 
 
A lot of high perf string parsing libs with indexes today  have issues with
Asian chars I used one as a user which was php or python at the back and it
failed for some Chinese chars resulting in a mess  but there may be a case
for using UCS-2  where you know you wont have Asian characters as it would
be faster..

Anyway summing up I prefer this a  string is an indexable collection of
characters with an encoding. 

The default IMHO should be be UTF-8  which results in fast reading and
writing , concatenation  , reasonably fast replace , Indexes are supported
but maybe O(n)  though libs will use nearest char to reduce the impact (
which is done anyway in most libs) .  For cases where lots of indexes /
mutable work are used , conversion to UCS-2 or UCS-16 (Asian) should be used
maybe based on internationalization settings.  Note this conversion is free
as it is done in C , C# and Java  anyway and its a good idea for the
standard lib NOT to work with mutable strings anyway.

A special lib supporting fast character processing  on mutable strings can
just check what format the string is in and convert it if necessary , this
is just an expanded version of a StringBuilder needed anyway to build large
strings efficiently.

I am well away no  common language has made this choice but 
 Most of those language originated when UCS-2 was the grand solution or are
tied to runtimes that do.
 More Immutable strings  push the argument for less indexing.
 UTF-8 growth has been recent  and driven by practical experience before
everyone was saying UTF-16 would rule.
 Note Ascii with the .NET scheme of indexing multi byte chars in specialist
libs for those cultures was the way things were done in C before when
performance was more of a premium.  Ie as you said at the lower level a
string is set of bytes with a length and an encoding and string searches are
a higher level with internationalization etc . Consider cultural issues  eg
Turkish ie there exists a capital "i with a dot," character  (\u0130),
which is the capital version of i. Similarly, in Turkish, there is a
lowercase "i without a dot," or  (\u0131), which capitalizes to I  , Which
requires more complex searching and indexing 
 I wouldnt worry about performance in Chinese they are already blessed
with words that are half our size and they suffer a lot from expensive ASCII
in UCS-4 . UCS-4 is not popular here most use a custom encoding on Ascii. (
In Chinese anyway indexes lookups are always string searches very rarely
char lookups) 

Even if this scheme is wrong in the long term  ( the risk is conversion
being too frequent but if < 2 your probably will be better of with UTF8) you
will do great in bench marks which will help the language adoption , and the
default could be changed to UCS-2  /UTF-32 depending on internationalization
settings later. The fact that we had higher encodings on Ascii before (
especially in Asia) and it worked well.

Ben


_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to