Re: [review] new string type

Steven Schveighoffer Wed, 01 Dec 2010 14:10:40 -0800

On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <[email protected]>wrote:

On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis<[email protected]>
wrote:
> 1. At least until universal function syntax is in the language, youuse
> the
> ability to do str.func().
Yes, but first, this is not a problem with the string type, it's aproblemwith the language. Second, any string-specific functions can be added,it
is a struct after all, not a builtin.
I really don't think that we want to have to be adding string functionsto astring struct. Having them external is far better, particularly whenthere areso many of them which aren't directly related to strings but use themheavily.Even if all of std.string got tacked onto a struct string type, thatwould leaveout the rest of Phobos and any user code. I really think that universalfunction
syntax is a necessity for a struct solution to be acceptable.

Why? It's a choice between adding them to the struct, and doing func(str)instead. You already have to do func(R) for arbitrary ranges, I don't seewhy strings should be specialized. For things that should be considered'built-in' for strings, they can be included as members of the struct.

In other words, I think it's worth fixing strings now, even if it means wecannot call str.func() until the universal syntax is introduced.

I had considered a solution similar to this one a few months back, andthe lackof universal function syntax is one of the reasons why I decided that itwasn'treally an improvement. Honestly, without uniform function call syntax, Iwouldconsider a struct solution to be DOA. The loss in useability would justbe too
great.

But you can still call the function, it's just func(str). There is noloss of functionality. Usability isn't even really lost.

> 2. Functions that would work just fine treating strings as arrays of
> code units
> (presumably because they don't care about what the actual data is)lose
> efficiency, because now a string isn't an array.

Which functions are those?  They can be allowed via wrappers.
It is my understanding that there are various functions in std.algorithmwhichare able to treat strings as arrays and therefore process them moreefficiently. Ihaven't looked into which ones are in that camp. I would think thatfind() might
be, but I'd have to look.

Since std.algorithm works exclusively with ranges, those must be specialcases for strings because strings are bi-directional ranges of dcharaccording to phobos. So those can continue to be special cases for thenew string types.

> 4. Indexing is no longer O(1), which violates the guarantees of theindex
> operator.

Indexing is still O(1).

> 5. Slicing (other than a full slice) is no longer O(1), which violates
> the
> guarantees of the slicing operator.

Slicing is still O(1).
You're right. I misread what _charStart() did. However, if I understanditcorrectly now, you give it a code unit index and yet get a code pointback. Thatworries me. It means that there is no relation between the length of thestringand the indices that you'd use to index it. It _is_ related to thenumber ofcode units, which you can get with codeUnits(), but that disjoint seemslike itcould cause problems. It might be the correct solution, but I think thatitmerits some examination. Returning a code unit would be wrong since thattotallybreaks with the rest of the API dealing in dchar/code units, and itwould bewrong to index by code point, since then indexing and slicing are nolongerO(1). So, it's probably a choice between indexing by one thing andreturninganother or not having indexing and slicing, which would definitely notbe good.So, maybe your solution is the best one in this respect, but it worriesme. The
exact ramifications of that need to be looked into.

How many times do you use a hard-coded index into a string without knowingthe encoding of the string (i.e. I know this string is ascii)? How manytimes do you iterate the characters of a string via an incrementing index?

It's not common to use the indexing operation with values that are notknown or computed to be valid starts to code-points. In fact, thelanguage depends on that (otherwise you'd see sliced strings everywherewith invalid data).


So while it looks strange, it shouldn't be a common need.

That being said, I think Lars pointed out that the strangeness ofreturning the code point even if you point in the middle would besurprising in some cases, so I think the better solution is to throw anexception.

> What you're doing here is forcing the view of a string as a range of
> dchar in
> all cases. Granted, that's what you want in most cases, but it can
> degrade
> efficiency, and the fact that some operations (in particular indexing
> and slicing)
> are not O(1) like they're supposed to be means that algorithms which
> rely on
> O(1) behavior from them could increase their cost by an order of
> magnitude. All
> the cases where treating a string as an actual array which arecurrently
> valid
> are left out to dry

You can still use char[] and wchar[].
Except that what if you need to do both with the same type? Right now,you couldhave a function which treats a string as a range of dchar while anotheronewhich can get away with treating it as a range of code units can treatit as anarray. You can pass the same string to both, and it works. That shouldstillwork if we go for a struct solution. Special-casing on strings andspecificallyusing the internal array instead of the struct for them could fix theproblem,
but it still needs to be possible.

Yeah, it's definitely needed. I'll add access to the data member in thenext version.

Hopefully you can see that I'm not eliminating the functionality you are
looking for, just making it not the default.
There is definitely some value in making strings treated as ranges ofdchar bydefault. But for the most part, that's the case already thanks to howstd.arrayis defined. The only place where that runs into trouble is if you useforeach.


or indexing.  This is a huge problem.

It
still treats them as arrays of code units unless you tell it to iterateoverdchar. Either making foreach over character arrays iterate over dchar bydefaultor making it a warning or error to use foreach with a char or wchararray of any
kind without specifying the type to iterate over would fix that problem.

I agree that would be ideal, but it still doesn't solve the indexingproblem.

There is an inherent but necessary disjoint between having strings bearrays of
code units and ranges of dchar. Sometimes they need to be treated as one,
sometimes as the other. Ideally, the default - whichever it is - wouldbe theone which leads to fewer bugs. But they both need to be there. A structsolutionis essentially an attempt to make strings treated as ranges of dchar inmoresituations by default than is currenly the case. As such, it could bebetterthan what we have now, but I'm honestly not convinced that it is. Asidefrom theforeach problem (which could be fixed with an appropriate warning orerror -
preferrably error), what we have works quite well.

The thing is, the most common use of strings is as a string, not as anarray of code-units. The common case is to print, slice, find, etc. on a*string*. When dealing with the string as a whole, either using an arrayor a specialized type works equally well.

The uncommon case is to extract individual characters from the string. Inthis case, the default needs to be the most common need in that area --extracting a dchar, not a code-unit. Having the default index operationextract a code-unit is very incorrect.


-Steve

Re: [review] new string type

Reply via email to