Re: [bitc-dev] String encoding, again

Ben Kloosterman Tue, 22 Mar 2011 18:47:22 -0700


 

So here, concretely, is what I'm contemplating:

 

1. Strings will have unspecified internal representation, but well-formed
strings will contain code points, not code units.

2. There will be some form of string accessor object or set of functions.
The general idea is that you specify the index of the first desired code
point and the length, and get back a StringAccessor containing those code
points.

 

 

In most use cases returning a sub string ( even a char at an index) is
sufficient  even if one char .. I think  the accessor  is good but is the
usefulness  gap between  getting a substring and getting the underlying
array and using that  justified ( I think so ).,

 

As I mentioned in C# the use of index leads to difficult to maintain code
but its quite interesting that originally they expected  people to use
ToCharArray and possibly unsafe methods for lots of indexing  but they added
the indexing to string and the end result is people just use string and the
performance is normally always good enough. For BitC though we have a more
extreme performance requirement but if we deny a char index ( say we only
support an index returning a string) people will use arrays /vectors more
for such work  (we also have the issue that any char we return must be USC-4
and hence frequently require conversion) 

 

 

 

 

3. The string accessor notion can be extended to other things such as code
units.

 

The question then becomes: why shouldn't we simply decide that vectors make
perfectly fine holders of code units, and Strings are the means for
representing code points? The only objection I can think of is that this is
not (regrettably) how various native string representations operate (e.g.
.Net).

 

 

This was the original 2 string  types I suggested as its lightweight and
simple and achieves both goals ( fast array and a programmer friendly
string)  .This would consist of a char array which is discouraged , but
there are a  few functions that use it and a widely used string  ( which may
wrap the char array and has a toArray method) . The argument at the  time
against this   was polluting the lib with 2 different string types  , but
the reality is they have very different roles. Your not going to see many
methods on the char array  as its mainly do it yourself stuff.  

 

  .NET interop shouldn't be an issue its just a UTF16 string and the strange
indexing are  not our problem we should get string as an opaque type to use
UTF16 internally if the program uses a lot of interop. Note here again
disallowing  char ch = string[x]  but allowing   str ch = string.getSub(x ,
1)   or StringAccessor  prevents any issues with people trying to use .NET
strings like .NET does in BitC .

 

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to