On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]> wrote:
> True but there is a lot of code like Index of start, index of end take > substring. This could be horrible say for </body> on a typical 2-3K html > page and even from </body>. would be bad. > Not so. I think the misunderstanding lies here: I assume the O(log)n is referring to the fact that in many cases the search > is not from the start. .. > Not so. It's the time for an arbitrary indexing operation on an arbitrary string proceding *de novo*. The next/previous character operations are O(1) in all implementations **other than** UTF-8/UTF-16 (I disregard UTF-32, because in that case we would instead use ucs4[]). The UTF encodings do not synchronize backwards in any sensible fashion. The implementation I have in mind is as follows. Some of the low-level details are being made up as I go along. 1. Code points in that can be encoded in a single UTF-8 byte are represented as bytes. Code points that cannot be encoded in utf-8.1 but can be encoded in a single utf-16 unit are encoded as uint16. All others are encoded as uint32. This choice, by the way, is not innocent; it gives us leave to decide when a run is too short to justify switching encodings. 2. We break a string into a sequence of "strands" such that all elements of a given strand have like encodings. The string is then represented as a balanced tree of substrings keyed by their starting positions. 3. The division of a string into strands is performed *once *at [de]serialization time, or when literals are handled by the compiler. Such a string has O(log n) random access time where 'n' is the number of strands, and O(1) next/prev time. In practice, most strings will have homogeneous representation, in which case random access time is O(1). In futher practice, the number of strands tends to be small, so the difference between O(log n) and O(1) is negligible. The string representation advocated has the further advantage that it plays nicely with concurrent incremental copying GC, where a size-bounded unit of non-interruptable copy is required. > Lastly is it a good idea supporting multiple underlying schemes aside from > legacy support methods like ToFixedCharArray() ? Java and .NET have > survived without it and having single schemes helps interop. Eg a >a byte > code file ( .NET assembly or windows dll) will work on any machine but with > different possible internal storage schemes this would not be possible. > I think that's wrong. Reading a string from a bytecode file qualifies as serialization. All that is required is a normative byte code file format, and that's got nothing to do with the internal string representation. > If your saying we leave it up to the lib that doesn’t really change things > as it just moves the discussion to that point. > Actually, it does, and I'll come back to that issue in my next note. > Previously the string document made a strong statement that BitC would > use UCS-2 or USC-4 and hence fixed with chars by adding UCS-1 that part of > the document doesn’t say much anymore…eg the runtime will use Unicode and > may use byte or char indexing. > That's because I had it wrong, and I haven't had a chance to update it. But more in a moment. > Hope im not railroading the development by questioning this... > You aren't. In fact, it's good to hash this out to test ideas on all sides before we carve anything in stone. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
