Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Sat, 16 Oct 2010 01:22:42 -0700

1)      For adding strings isn't it an issue if you add say 1 UCS-1    then
1 UCS-4 then 1 UCS-1   due to the "tree  overhead ? So you will need to
parse and re-encode ? Do you create strings from chars as UCS-4  in a a
string builder  or will string builder change to UCS-4 when needed ?


2)      Let me confirm that this ,each strand is a separate heap object
right ? 

 

Anyway re the systems discussed /proposed 

 

.         UCS-2 based systems , flawed due to lack of encoding all chars 

 

.         UCS-4 too heavy on memory. 

 

.         UTF-8 with char indexes  , O(n) scans so not suitable for medium
to long strings.  Risk on Medium to long stream performance , 

 

.         UTF-8 with binary index  ,O(1) lookups ,  Issues with backward
compatible API / algorithms  and developers needing to understand its bytes.
Not great for long strings due to  large memory object.  Risk on backward
compatibility and developer incorrect usage

o   Use operator overloading  on +  , will require a typedef of the index
type and could remove some of the compatibility issues.

o   value type option . Will still require a reference type string eg
longstring  so can only be considered if we do that.  Usefull in cases where
you don't have/want a heap

 

.         Strand use ,  benefits for large strings and GC  , some concern
over small strings overhead and how it will work in practice eg aggregating
mixed strings how to  create strings efficiently  .  Risk is for small
string performance and possibly creating strings. 

.         String and long string  . Risk on the Runtime/lib and supporting 2
string types.  Has the real advantage of competition  , if one has issues
the superior type will win.

 

 

Ben

 

From: [email protected] [mailto:[email protected]] On
Behalf Of Jonathan S. Shapiro
Sent: Friday, October 15, 2010 3:03 PM
To: [email protected]; Discussions about the BitC language
Subject: Re: [bitc-dev] Unicode and bitc

 

2010/10/14 Ben Kloosterman <[email protected]>

The main cons I see is besides the tree index/reference cost , each
substring would need a field (which may be aligned to 4-8 bytes) or char  to
indicate the encoding and the higher initial / final parse overhead.

 

Yes. That field is two bits and can be encoded in the low-order to bits of
the relevant array reference.

 

Another biggy is adding a string of UTF-8 one bytes to a string of 2 byte
chars such operations would require a conversion each time..And this would
be common in foreign languages  eg html and xml parsing.  ( though splitting
would be cheap as it would often occur along natural lines) 

 

Nope. The strands are deep-constant. Appending a string of UTF-8 bytes to a
string of UTF-16 bytes is merely a matter of appending metadata. The content
runs don't change at all.

 

I was referring to the standard lib agnostic issue William mentioned which
im not sure you are even pursuing , eg  person A builds BitC with USC-2
standard lib , person B builds it with UTF-8 then dropping such
DLL/lib/assemblies on the same machine will not work together.

 

I don't see a problem there, so long as the specification for
[de]serialization is sufficient.

 

shap

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/15/10
02:34:00

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to