Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Fri, 15 Oct 2010 06:02:11 -0700

>>
 >> :-)
 >> I think the memory overhead is too bad in business application and DB
 >> land... but it is a good candidate for the worked on  fixedchararrays
 >. Note
 >> I'm not viewing it from the point of view of Unix apps reading the
 >data and
 >> then finishing but more from memory hungry Business apps, App servers
 >,  Web
 >> servers , SOA servers , DBs etc where the strings stay in memory for a
 >long
 >> time while not used.
 >>
 >
 >I think this is in most cases caused by poor design which in turn is
 >caused by the relative expenses of buying more hardware and improving
 >the design.


Good design is expensive ...run times correctly optimize developer time.. 
Witness the use of string indexers everywhere which require lots of string 
compares. 

<snip>

>
 >The native encodings for CJK include characters not in Unicode which
 >may or may not be added to Unicode eventually.
 >
 >There are also encodings which include character variants to
 >accurately represent ancient Chinese texts, for example, which are
 >likely not going to be folded into Unicode.
 >
 >So you should be prepared for situations when text in an external
 >encoding cannot be completely converted into the internal encoding.
 >

CJK is a mess and attempting to fold the common traditional chars used in Japan 
, Hong Kong , Taiwan and Korea was a huge mistake and has led to the slow 
adoption of Unicode there.  The #1 issue as has been mentioned is that the 
representation is not 1:1.

The ancient Chinese forms including bone script are now in Unicode.  What you 
do need to be prepared for is the character set changes...

All the 70,000 Simplified Chars are in Unicode though it does change every 
year. 

 > So you're okay with reducing the D-cache and D-TLB performance on
 >> large-scale programs, and therefore their overall performance, by a
 >factor
 >> of >4? That seems a bit over-purist to me.
 >>
 >> So first, I think this is the wrong way to prioritize as a matter of
 >> defaults, but second, I think I've already made it clear that no
 >either/or
 >> choice is actually required. The "stranded string" approach does all
 >of what
 >> you want and more. The O(log n) factor issue is more than compensated
 >for by
 >> the improvement in D-cache and D-TLB utilization.
 >>
 >
 >By calling for a more complex implementation you are reducing the
 >I-cache and I-TLB (which may or may not be separate from data).
 >
 >I can't say which is more important in which situation without tedious
 >analysis of running actual programs on actual hardware, though.

I don’t think the strand representation will use a lot of code ( though it does 
need a lto of thought and tuning) 

 >>
 >> 2. The problem we are trying to solve ( GC , O(N) )  apply only to
 >large
 >> strings so why pay the price for frequently used small strings. A
 >horses for
 >> courses approach may fit better and the big string can solve a number
 >of
 >> other problems.
 >
 >Since *any* reference takes 8 bytes (the size of a pointer) I don't
 >see an empty string taking 8 bytes as an issue.

Empty string wont be 8 bytes you are looking at  the reference to string 

+ the object overhead + 2 internal null pointers for an increase of 16 bytes..

Even worse when not using null able references you would have a string array 
initialized to empty strings this could be nasty for a multi dimensional array 
( eg data tables , sql readers etc) .  Anyway empty arrays by themselves are 
not a huge issue but small strings in general especially as they are very 
frequent.

Ben  



_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to