Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Thu, 14 Oct 2010 19:10:52 -0700

 

>In futher practice, the number of strands tends to be small, so the
difference between O(log n) and O(1) is negligible.




Im not sure this is true for example in all languages “<” . “>” point
and numbers are ASCII. In chinese Y-M-D  is mixed chinese and ASCII numerics
, in fact in nearly all languages you have UCS-2 codes but interspersed
ASCII numbers and punctuation. So you would need some sort of complex
encoding such that sequences of < length n stay in the higher encoding form.
This is also good because short strings would not need a tree and hence
incur no cost.

 

Interesting option that deserves more thought , I’m not sold on byte
indexes with operator overloading either.  It also has the pro of
introducing line indexes trivially 

 

The main cons I see is besides the tree index/reference cost , each
substring would need a field (which may be aligned to 4-8 bytes) or char  to
indicate the encoding and the higher initial / final parse overhead.

 

Another biggy is adding a string of UTF-8 one bytes to a string of 2 byte
chars such operations would require a conversion each time..And this would
be common in foreign languages  eg html and xml parsing.  ( though splitting
would be cheap as it would often occur along natural lines) 

 

<div> 

关于支付宝

</div>

 

My gut feel says this method is a bit too heavy unless byte indexes have too
many issues  , I think it is superior to C# , Java and likely to schemes
using UTF-8 but char/point  indexes.  

 

>>Lastly is it a good idea supporting multiple underlying schemes aside from
legacy support methods like ToFixedCharArray() ? Java and .NET  have
>>survived without it and having single schemes helps interop.  Eg a >a byte
code file ( .NET assembly or windows dll) will work on any machine  but with
>>different  possible  internal storage schemes this would not be possible.

>I think that's wrong. Reading a string from a bytecode file qualifies as
serialization. All that is required is a normative byte code file format,
and >that's got nothing to do with the internal string representation. 

 

I was referring to the standard lib agnostic issue William mentioned which
im not sure you are even pursuing , eg  person A builds BitC with USC-2
standard lib , person B builds it with UTF-8 then dropping such
DLL/lib/assemblies on the same machine will not work together.

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to