Re: [bitc-dev] Unicode and bitc

Jonathan S. Shapiro Thu, 14 Oct 2010 12:52:08 -0700

On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]> wrote:



> True but there is a lot of code like  Index of start, index of end  take
> substring.  This could be horrible say for </body> on a typical 2-3K html
> page and even from </body>. would be bad.
>

Not so. I think the misunderstanding lies here:

I assume the O(log)n is referring to the fact that in many cases the search
> is not from the start. ..
>

Not so. It's the time for an arbitrary indexing operation on an arbitrary
string proceding *de novo*. The next/previous character operations are O(1)
in all implementations **other than** UTF-8/UTF-16 (I disregard UTF-32,
because in that case we would instead use ucs4[]). The UTF encodings do not
synchronize backwards in any sensible fashion.

The implementation I have in mind is as follows. Some of the low-level
details are being made up as I go along.

1. Code points in that can be encoded in a single UTF-8 byte are represented
as bytes. Code points that cannot be encoded in utf-8.1 but can be encoded
in a single utf-16 unit are encoded as uint16. All others are encoded as
uint32. This choice, by the way, is not innocent; it gives us leave to
decide when a run is too short to justify switching encodings.

2. We break a string into a sequence of "strands" such that all elements of
a given strand have like encodings. The string is then represented as a
balanced tree of substrings keyed by their starting positions.

3. The division of a string into strands  is performed *once *at
[de]serialization time, or when literals are handled by the compiler.

Such a string has O(log n) random access time where 'n' is the number of
strands, and O(1) next/prev time. In practice, most strings will have
homogeneous representation, in which case random access time is O(1). In
futher practice, the number of strands tends to be small, so the difference
between O(log n) and O(1) is negligible.

The string representation advocated has the further advantage that it plays
nicely with concurrent incremental copying GC, where a size-bounded unit of
non-interruptable copy is required.


> Lastly is it a good idea supporting multiple underlying schemes aside from
> legacy support methods like ToFixedCharArray() ? Java and .NET  have
> survived without it and having single schemes helps interop.  Eg a >a byte
> code file ( .NET assembly or windows dll) will work on any machine  but with
> different  possible  internal storage schemes this would not be possible.
>

I think that's wrong. Reading a string from a bytecode file qualifies as
serialization. All that is required is a normative byte code file format,
and that's got nothing to do with the internal string representation.



> If your saying we leave it up to the lib that doesn’t really change things
> as it just moves the discussion to that point.
>

Actually, it does, and I'll come back to that issue in my next note.


>   Previously the string document made a strong statement that BitC would
> use UCS-2 or USC-4  and hence fixed with chars by adding UCS-1 that part of
> the document doesn’t say much anymore…eg the runtime will use Unicode and
> may use byte or char  indexing.
>

That's because I had it wrong, and I haven't had a chance to update it. But
more in a moment.


>  Hope im not railroading the development by questioning this...
>

You aren't. In fact, it's good to hash this out to test ideas on all sides
before we carve anything in stone.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to