Re: [bitc-dev] String encoding, again

Ben Kloosterman Tue, 15 Mar 2011 17:25:28 -0700


 

So I'm looking at string encoding issues again, and concluding that it's
just as icky as it was the last time I looked. I've looked at Python, and I
do think they did right by declaring that I/O happens in units of bytes with
conversion occurring at a layer above the I/O layer. Separately, I've
concluded (reluctantly) that we really do need constant-time string
indexing, and that I've been a dolt about that.

 

 

Unfortunately for  languages like Chinese which introduce new characters
that may be common , you can only have constant-time string indexing with
UCS-32. 

 



Aside from human convenience of naming, I do not think that we need to
introduce a 'bytes' type in the way that Python did. I think byte[] (that
is: byte vector) is sufficient for this purpose.

But that leaves us with the unpleasant question of UCS32 vs. UCS16 as the
normative BitC string representation. 

 

 

I don't see the reasoning for going UCS16 at all  ( except for conversions
to .NET ) , originally the idea was you can represent all chars in UCS16 but
this is no longer true , once you have any encoding  you may as well go
Ascii.   IMHO this is all  failed and we would have been better of sticking
with ASCII but introduce a uniform extension - its no coincidence that UTF-8
is growing so fast and dominating where space/performance are important eg
HTML and XML.  The fact most documents even Chinese are riddled with ASCII
means UTF-8 with extensions is still far more efficient than UTF-16 or
UTF-32 . UCS-2  is dead  and people who used it moved to the flexible UTF-16
because UCS-2 to UTF-8 is harder  and not backward compatibe.

 

So back the constant time indexing 

ASII/ UTF-8    nope ( constant time indexing only for common languages)

UTF-16- nope ( for nearly all western European but not asian ) 

UCS-2    yes - but is dead and cant represent some chars.  (UCS-16 doesn't
exist so im assuming UTF-16) 

UTF-32  yes  but poor performance and excessive storage. 

 

It's a tough choice.. which is why last time constant time indexing gave way
. What is your objection to this anyway wouldn't constant time most of the
time be good enough ?

 

 

While I don't like the space consumption, I think that UCS32 is the right
answer, because it is the most flexible of the available encodings. The
principle disadvantage is space. The only real solution for applications
that are concerned with this is to (a) decode strings only when needed, or
(b) carry uninterpreted strings around in some more-compact form as
instances of byte[].

 

 

But that also means any tom dick and harry micro benchmark such as a simple
xml parser will give poor results for BitC.  Not good for a new language
trying to be competitive with C 

 



The problem at that point is that we really *do* want the option to target
environments like CLI and JVM, and neither of these uses UCS32 as their
native string encoding. Inter-converting representations "by magic" is
certainly not a good idea, and I want to avoid a proliferation of string
types corresponding to each encoding.

 

 

CLI is a bit weird , due to the OS changing from UCS2 to a non fixed length
encoded UTF-16 - when everyone realized USC-2 is not good enough .  CLI uses
UTF-16 which is not fixed length for some languages ie in Chinese 2 chars in
the string may represent one.  Which means string.Length  is not the letter
count ,   and an index max not represent the actual  letter  but it works
for western European sets.

 

 



One approach would be to introduce an opaque reference type NativeString,
and a set of runtime operations that will produce NativeString from String
(and the other way as well), and possibly NativeString from byte[]. The
reason to make NativeString strictly opaque is error-prevention. If we
support indexing operations on NativeString, we invite people to write code
that assumes a particular encoding of NativeString, and that code will run
incorrectly (or worse: appear to run correctly) on other platforms.

 

 

I proposed something similar last time ie a fast native string  and a more
featured tree string  but the issues were

1)      Library issues for 2 string types

2)      You can have a single string type and hide the internal
representation . 

 

Why not leave the internal representation open ?  

 

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to