On Wed, Aug 28, 2013 at 6:25 AM, Bennie Kloosteman <[email protected]>wrote:

> ...The fact that 90% of strings are 0X00 0x??  0x00 0x?? etc seems
> monumentally wastefull even for foreign languages ..
>

That's an amazingly western-centric view, and it's flatly contradicted by
actual data.

I'm in favor of UTF8 strings, and also of "chunky" strings in which
sub-runs are encoded using the most efficient encoding for the run. Those
are a lot harder to implement correctly than you might believe.

The problem with UTF8 strings is that they do not index efficiently. s[i]
becomes an O(log n) operation rather than an O(1) operation. For sequential
access you can fix that with an iteration helper class, but not all access
is sequential. The same problem exists for strings having mixed formats.


> Pretty much 60% of the data moved around or compared for most string
> operations is a huge win over C# and Java  . Most web sites are UTF8-ASCII
> and even foreign web sites are 80-90% ASCII .
> Think middle tier performance json , xml  etc etc , Maybe enough to lift
> mono over those products.
>

The proportion of in-heap string data has grown since I last saw
comprehensive measurements, and for applications like DOM trees it's a big
part of the total live working set. But data copies are *not* the dominant
issue in performance in such applications. Data indexing is. This is why
IBM's ICU library is so important. It reconciles all of the conflicting
definitions of indexing methods and implements the classes that make the
reconciliation possible.


> It would be nice if immutable shallow types are interred in the special
> heap where the  mark doesnt scan like strings but i doubt thats possible.
>  Also the above is not possible in safe C# ( because of the fixed array)
>

Mark *never* scans strings, so I don't know what you mean here.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to