1) For adding strings isn't it an issue if you add say 1 UCS-1 then 1 UCS-4 then 1 UCS-1 due to the "tree overhead ? So you will need to parse and re-encode ? Do you create strings from chars as UCS-4 in a a string builder or will string builder change to UCS-4 when needed ?
2) Let me confirm that this ,each strand is a separate heap object right ? Anyway re the systems discussed /proposed . UCS-2 based systems , flawed due to lack of encoding all chars . UCS-4 too heavy on memory. . UTF-8 with char indexes , O(n) scans so not suitable for medium to long strings. Risk on Medium to long stream performance , . UTF-8 with binary index ,O(1) lookups , Issues with backward compatible API / algorithms and developers needing to understand its bytes. Not great for long strings due to large memory object. Risk on backward compatibility and developer incorrect usage o Use operator overloading on + , will require a typedef of the index type and could remove some of the compatibility issues. o value type option . Will still require a reference type string eg longstring so can only be considered if we do that. Usefull in cases where you don't have/want a heap . Strand use , benefits for large strings and GC , some concern over small strings overhead and how it will work in practice eg aggregating mixed strings how to create strings efficiently . Risk is for small string performance and possibly creating strings. . String and long string . Risk on the Runtime/lib and supporting 2 string types. Has the real advantage of competition , if one has issues the superior type will win. Ben From: [email protected] [mailto:[email protected]] On Behalf Of Jonathan S. Shapiro Sent: Friday, October 15, 2010 3:03 PM To: [email protected]; Discussions about the BitC language Subject: Re: [bitc-dev] Unicode and bitc 2010/10/14 Ben Kloosterman <[email protected]> The main cons I see is besides the tree index/reference cost , each substring would need a field (which may be aligned to 4-8 bytes) or char to indicate the encoding and the higher initial / final parse overhead. Yes. That field is two bits and can be encoded in the low-order to bits of the relevant array reference. Another biggy is adding a string of UTF-8 one bytes to a string of 2 byte chars such operations would require a conversion each time..And this would be common in foreign languages eg html and xml parsing. ( though splitting would be cheap as it would often occur along natural lines) Nope. The strands are deep-constant. Appending a string of UTF-8 bytes to a string of UTF-16 bytes is merely a matter of appending metadata. The content runs don't change at all. I was referring to the standard lib agnostic issue William mentioned which im not sure you are even pursuing , eg person A builds BitC with USC-2 standard lib , person B builds it with UTF-8 then dropping such DLL/lib/assemblies on the same machine will not work together. I don't see a problem there, so long as the specification for [de]serialization is sufficient. shap No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/15/10 02:34:00
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
