Re: [bitc-dev] String encoding, again

Ben Kloosterman Wed, 23 Mar 2011 17:27:36 -0700


 

On Tue, Mar 22, 2011 at 6:46 PM, Ben Kloosterman <[email protected]> wrote:

As I mentioned in C# the use of index leads to difficult to maintain code
but its quite interesting that originally they expected  people to use
ToCharArray and possibly unsafe methods for lots of indexing...


That's a bit of history that I had not known, and it's very useful.

 

 

Its quite interesting how it evolved especially the way its used ..maybe
worth a thread with some other languages. Unsafe and arrays was expected to
be used often but is  very rarely used ,   almost from the start most people
used collections ArrayList and later the Generic typesafe List  ( its quite
nice how its easy to substitute collections when needed ) . LINQ is becoming
very popular and people eventually seem to gravitate to  the inversion of
control pattern ( which is almost an anti pattern but helps simplify larger
apps) .

 


 

  ...but they added the indexing to string and the end result is people just
use string and the performance is normally always good enough.


Umm. Ben? Any chance that this is because they defined char as UCS16 and in
practice always implement String using the same heap data structure that
Vector<ucs16> uses? That is: indexing performance is good enough because (a)
it is constant time, and (b) it's semantics is broken in exactly the way we
are hoping to avoid.

 

Indexing is rarely used , its now a case people use Replace a LOT even when
modifying a single char which since its returns a new string is inefficient
but it doesn't matter in the scheme of things. String.Format is also popular
and hence the most important operations by far are Find ( used by Replace)
and Concatenations.

 

Older code did more indexing but machines are so fast now its rarely needed
and indexing results in lots of nasty bugs , due to  foreign language issues
( 4 byte chars break) unexpected locations for the find and then
calculations for offsets going awry. 

 

 

For BitC though we have a more extreme performance requirement but if we
deny a char index ( say we only support an index returning a string) people
will use arrays /vectors more for such work  (we also have the issue that
any char we return must be USC-4 and hence frequently require conversion) 


I don't know whether our performance issues are more extreme or not.

 

In C# some of these things like the XML parser and maybe even the regular
expression object are written in C .. C# is a user application language
which does stretch  to OS and drivers but shows strains when it does.

 

 



Maybe we're trying to solve the wrong end of the problem here. How hard can
it really be to get China to officially adopt a Western language? :-)
 

  .NET interop shouldn't be an issue its just a UTF16 string...


When I looked at this a year ago, I was appalled to learn that this just
isn't true. .NET strings are straddling the fence like crazy.

While the string representation is not defined, all implementations of .NET
use vector<UCS16>. Character indexing on a String returns a UCS16 unit that
may or may not be a well-formed Unicode code point.

 

 

Correct but this does not concern us .BitC will output a UTF-16 string and
in BitC we index it the BitC way and in C# they index it the poor .NET way .

 



All of the interesting string->string operations, however, are now
attempting to operate on code points. Substring is a case in point. In
classic Microsoft form, the error conditions when these operations get
handed bad input are not really specified.

 

 

Correct the use of index is just bad programming for most user apps which is
why im sort of saying to remove it to make the barrier harder and force
Developers to use arrays if they want it. 

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to