At least I was right about .NET after all these years and like Java it uses a borked system.
>BitC does not provide mutable strings. That's good >The get next/previous operation speed is much more important that finding the initial location. True but there is a lot of code like Index of start, index of end take substring. This could be horrible say for </body> on a typical 2-3K html page and even from </body>. would be bad. I know it's not really O(n) I just used it for want of a better word to indicate developers sending an index which is scanned from the start for a non fixed char representation , in most operations you can work of the result of the find within the library but when the user communicates it to the lib you have to either use - Additional storage for an indexes ( which is bad and complicated) - Use a complex index type say a union with char and byte offset ( in theory the compiler should make this just as efficient and it communicates well between programmer and lib ) . You could overcome legacy issues here by setting the index being the tradition char/code point rather than byte offset.. - Cache a single char offset to byte offset lookup for large strings. - Scan from the start. - Use byte offsets and do legacy code on a ToFixedCharArray() , I kind of like this since a lot of C legacy code relies on mutable strings. I assume the O(log)n is referring to the fact that in many cases the search is not from the start. .. Lastly is it a good idea supporting multiple underlying schemes aside from legacy support methods like ToFixedCharArray() ? Java and .NET have survived without it and having single schemes helps interop. Eg a >a byte code file ( .NET assembly or windows dll) will work on any machine but with different possible internal storage schemes this would not be possible . If your saying we leave it up to the lib that doesn't really change things as it just moves the discussion to that point. Previously the string document made a strong statement that BitC would use UCS-2 or USC-4 and hence fixed with chars by adding UCS-1 that part of the document doesn't say much anymore.eg the runtime will use Unicode and may use byte or char indexing. Hope im not railroading the development by questioning this J Just think the benefits of UTF-8 are strong when not dealing with legacy support. Ben UCS-2 which offers O(1) indexing and finds but cant represent most Asian chars requiring non standard encoding upon the internal string representation and takes 2 bytes storage per character. UTF-8 With O(n) indexing which allows the developer to refer to the character. Note on x86 you can use a fast SSE2 0x10 bit pattern scan to count characters quicker. UTF-8 with O(1) byte indexing with more runtime method focus and ToFixedCharArray methods for char indexing. Good list, but incomplete. Ropes with O(log(n)) indexing work just find in practice. Also worth noting that indexing is almost never random. The get next/previous operation speed is much more important that finding the initial location. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
