On Tue, 30 Nov 2010 23:34:11 +0000 (UTC) "Lars T. Kyllingstad" <[email protected]> wrote:
> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote: > > > On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis > > <[email protected]> wrote: > > > > [...] > > > >> 4. Indexing is no longer O(1), which violates the guarantees of the > >> index operator. > > > > Indexing is still O(1). > > > >> 5. Slicing (other than a full slice) is no longer O(1), which violates > >> the > >> guarantees of the slicing operator. > > > > Slicing is still O(1). > > > > [...] > > It feels extremely weird that the indices refer to code units and not > code points. If I write > > auto str = mystring("hæ?"); > writeln(str[1], " ", str[2]); > > I expect it to print "æ ?", not "æ æ" like it does now. If I understand correctly how _charStart works in combination with indexing and slicing, then here is something wrong in the type's interface. After auto str = mystring("hæ?"); Either one provides a code unit index and gets a code unit: writeln(str[1], " ", str[2]); // "� �" (invalid utf code points) Or one provides a code point index and gets a code point: writeln(str[1], " ", str[2]); // "æ ?" But for string manipulation, wouldn't it be better that your string type systematically wraps a dchar[] array, whatever the original encoding? For indexing, slicing, finding, counting, etc... to be fast, I mean. Decoding beeing done only once at string creation time. > On a side note: It seems to me that the only reason to have char, wchar, > and dchar as separate types in the language is that arrays of said types > are UTF-encoded strings. If a type such as the proposed one were to > become the default string type in D, it might as well wrap an array of > ubyte/ushort/uint, since direct user manipulation of the underlying array > will generally only happen in the rare cases when one wants to deal > directly with code units. Yes, but then, see remark above. Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
