Re: [bitc-dev] String encoding, again

Jonathan S. Shapiro Tue, 15 Mar 2011 17:57:35 -0700

On Tue, Mar 15, 2011 at 5:24 PM, Ben Kloosterman <[email protected]> wrote:


>
>
>
>
> So I'm looking at string encoding issues again, and concluding that it's
> just as icky as it was the last time I looked. I've looked at Python, and I
> do think they did right by declaring that I/O happens in units of bytes with
> conversion occurring at a layer above the I/O layer. Separately, I've
> concluded (reluctantly) that we really do need constant-time string
> indexing, and that I've been a dolt about that.
>
>
>
>
>
> Unfortunately for  languages like Chinese which introduce new characters
> that may be common , you can only have constant-time string indexing with
> UCS-32.
>

Agreed.

I don’t see the reasoning for going UCS16 at all  ( except for conversions
> to .NET )
>

Sorry. I keep writing UCS16 when I mean UCS2. I trip over the fact that UCS
and UTF use different size conventions.

.NET uses UTF16 encoding, but implements strings as vectors of UCS2, with
the consequence that an indexing operation can return either part of a code
unit pair. They compound the mess by equating the System.char type to UCS2.
Nothing wrong with a UCS2 type, but IMO it should not have been misnamed
"char".

I'm *still* puzzled by what they were thinking, since the conclusion that
UCS2 wasn't big enough was evident when they made this decision. The current
theory of operation seems to be that all handling of the upper code points
is done with strings (i.e. never with chars).

So back the constant time indexing
>
> ASII/ UTF-8    nope ( constant time indexing only for common languages)
>
> UTF-16- nope ( for nearly all western European but not asian )
>
> UCS-2    yes – but is dead and cant represent some chars.  (UCS-16 doesn’t
> exist so im assuming UTF-16)
>
> UTF-32  yes  but poor performance and excessive storage.
>
>
>
> It’s a tough choice.. which is why last time constant time indexing gave
> way . What is your objection to this anyway wouldn’t constant time most of
> the time be good enough ?
>

The problem is that you don't actually *get* constant time most of the time.
If strings are defined as UTF16-encoded UCS4, then you must assume that a
UCS4 code point may be present, and scan accordingly. Further, you end up
having to declare that a "char" is UCS2, which is pretty clearly busted.
Even if you set a bit in the string header to indicate that the string is
not "contaminated" by asian influences (BIG grin there), you still have to
*test* that bit on every indexing operation that gets compiled.

Now to answer your last question, I think my real concern here is to get
O(1) time iteration through a string, which isn't the same thing as O(1)
random access at all. But that appears to require building some sort of
"string iterator" object whose use is known to the code generator. That cure
seems potentially worse than the disease.


> While I don't like the space consumption, I think that UCS32 is the right
> answer, because it is the most flexible of the available encodings. The
> principle disadvantage is space. The only real solution for applications
> that are concerned with this is to (a) decode strings only when needed, or
> (b) carry uninterpreted strings around in some more-compact form as
> instances of byte[].
>
>
>
> But that also means any tom dick and harry micro benchmark such as a simple
> xml parser will give poor results for BitC.  Not good for a new language
> trying to be competitive with C
>

Your concern about microbenchmarks is, regrettably, something we cannot
ignore. At the same time, the problem is fundamental. We have to make
*some*choice.

If I adopt the "strings are UCS4" position, then the right way to write a
simple xml parser is by using vector (or array) of byte. Which is actually
the right way to do it in any case. If I adopt the "strings are UTF16 on
UCS2" position, then the xml parser should *still* be written using bytes;
all we've done is lose the indexing convenience on strings. If I adopt the
"strings are UTF8 on UCS1" position, then the XML parser *still* has to be
written with byte vectors, because otherwise there is no way to index the
individual code units for decoding purposes.

Am I missing something here?


CLI is a bit weird , due to the OS changing from UCS2 to a non fixed length
> encoded UTF-16 – when everyone realized USC-2 is not good enough .  CLI
> uses  UTF-16 which is not fixed length for some languages ie in Chinese 2
> chars in the string may represent one.  Which means string.Length  is not
> the letter count ,   and an index max not represent the actual  letter  but
> it works for western European sets.
>

Yes. Java, as I understand matters, came to a similar outcome, but managed
not to use UTF16 somehow - I'm not clear on the details of that, and I may
very well be mistaken.


> One approach would be to introduce an opaque reference type NativeString,
> and a set of runtime operations that will produce NativeString from String
> (and the other way as well), and possibly NativeString from byte[]. The
> reason to make NativeString strictly opaque is error-prevention. If we
> support indexing operations on NativeString, we invite people to write code
> that assumes a particular encoding of NativeString, and that code will run
> incorrectly (or worse: *appear* to run correctly) on other platforms.
>
>
>
> I proposed something similar last time ie a fast native string  and a more
> featured tree string  but the issues were
>
> 1)      Library issues for 2 string types
>
> 2)      You can have a single string type and hide the internal
> representation .
>
So I'm now convinced that [2] turns out not to be sensible. [1] remains a
concern, which is why I am suggesting that NativeString should be an *opaque
* type.


> Why not leave the internal representation open ?
>

Because the code generator needs to know how to emit indexing operations,
and because I'm presently getting ready to write a lexer and a string intern
subsystem in BitC and I'ld sorta like to know what the performance is likely
to look like.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to