Re: [bitc-dev] String encoding, again

Jonathan S. Shapiro Wed, 23 Mar 2011 15:16:30 -0700

On Tue, Mar 22, 2011 at 6:46 PM, Ben Kloosterman <[email protected]> wrote:


> As I mentioned in C# the use of index leads to difficult to maintain code
> but its quite interesting that originally they expected  people to use
> ToCharArray and possibly unsafe methods for lots of indexing...
>

That's a bit of history that I had not known, and it's very useful.


>   ...but they added the indexing to string and the end result is people
> just use string and the performance is normally always good enough.
>

Umm. Ben? Any chance that this is because they defined char as UCS16 and in
practice always implement String using the same heap data structure that
Vector<ucs16> uses? That is: indexing performance is good enough because (a)
it is constant time, and (b) it's semantics is broken in exactly the way we
are hoping to avoid.

For BitC though we have a more extreme performance requirement but if we
> deny a char index ( say we only support an index returning a string) people
> will use arrays /vectors more for such work  (we also have the issue that
> any char we return must be USC-4 and hence frequently require conversion)
>

I don't know whether our performance issues are more extreme or not.

Maybe we're trying to solve the wrong end of the problem here. How hard can
it really be to get China to officially adopt a Western language? :-)


>   .NET interop shouldn’t be an issue its just a UTF16 string...
>

When I looked at this a year ago, I was appalled to learn that this just
isn't true. .NET strings are straddling the fence like crazy.

While the string representation is not defined, all implementations of .NET
use vector<UCS16>. Character indexing on a String returns a UCS16 unit that
may or may not be a well-formed Unicode code point.

All of the interesting string->string operations, however, are now
attempting to operate on code points. Substring is a case in point. In
classic Microsoft form, the error conditions when these operations get
handed bad input are not really specified.

The definition of a .Net string does not, in fact, guarantee that the
sequence of UCS16 code units constitute a well-formed code point sequence
and there are quite a number of operations whose defined error checks are
insufficient to guarantee the code point well-formedness of Strings.
Notwithstanding this hole in the specification, there are many *other* parts
of the specification that appear to rely implicitly on the String
well-formedness constraint.

Of course, to extract those statements I had to scan about a billion pages
of Microsoft Standardese. It's altogether possible that I missed something
somewhere...



shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to