On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <[email protected]> wrote:

On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <[email protected]>

wrote:
> 1. At least until universal function syntax is in the language, you use
> the
> ability to do str.func().

Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it
is a struct after all, not a builtin.

I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function
syntax is a necessity for a struct solution to be acceptable.

Why? It's a choice between adding them to the struct, and doing func(str) instead. You already have to do func(R) for arbitrary ranges, I don't see why strings should be specialized. For things that should be considered 'built-in' for strings, they can be included as members of the struct.

In other words, I think it's worth fixing strings now, even if it means we cannot call str.func() until the universal syntax is introduced.

I had considered a solution similar to this one a few months back, and the lack of universal function syntax is one of the reasons why I decided that it wasn't really an improvement. Honestly, without uniform function call syntax, I would consider a struct solution to be DOA. The loss in useability would just be too
great.

But you can still call the function, it's just func(str). There is no loss of functionality. Usability isn't even really lost.

> 2. Functions that would work just fine treating strings as arrays of
> code units
> (presumably because they don't care about what the actual data is) lose
> efficiency, because now a string isn't an array.

Which functions are those?  They can be allowed via wrappers.

It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might
be, but I'd have to look.

Since std.algorithm works exclusively with ranges, those must be special cases for strings because strings are bi-directional ranges of dchar according to phobos. So those can continue to be special cases for the new string types.

> 4. Indexing is no longer O(1), which violates the guarantees of the index
> operator.

Indexing is still O(1).

> 5. Slicing (other than a full slice) is no longer O(1), which violates
> the
> guarantees of the slicing operator.

Slicing is still O(1).

You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The
exact ramifications of that need to be looked into.

How many times do you use a hard-coded index into a string without knowing the encoding of the string (i.e. I know this string is ascii)? How many times do you iterate the characters of a string via an incrementing index?

It's not common to use the indexing operation with values that are not known or computed to be valid starts to code-points. In fact, the language depends on that (otherwise you'd see sliced strings everywhere with invalid data).

So while it looks strange, it shouldn't be a common need.

That being said, I think Lars pointed out that the strangeness of returning the code point even if you point in the middle would be surprising in some cases, so I think the better solution is to throw an exception.

> What you're doing here is forcing the view of a string as a range of
> dchar in
> all cases. Granted, that's what you want in most cases, but it can
> degrade
> efficiency, and the fact that some operations (in particular indexing
> and slicing)
> are not O(1) like they're supposed to be means that algorithms which
> rely on
> O(1) behavior from them could increase their cost by an order of
> magnitude. All
> the cases where treating a string as an actual array which are currently
> valid
> are left out to dry

You can still use char[] and wchar[].

Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem,
but it still needs to be possible.

Yeah, it's definitely needed. I'll add access to the data member in the next version.

Hopefully you can see that I'm not eliminating the functionality you are
looking for, just making it not the default.

There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach.

or indexing.  This is a huge problem.

It
still treats them as arrays of code units unless you tell it to iterate over dchar. Either making foreach over character arrays iterate over dchar by default or making it a warning or error to use foreach with a char or wchar array of any
kind without specifying the type to iterate over would fix that problem.

I agree that would be ideal, but it still doesn't solve the indexing problem.

There is an inherent but necessary disjoint between having strings be arrays of
code units and ranges of dchar. Sometimes they need to be treated as one,
sometimes as the other. Ideally, the default - whichever it is - would be the one which leads to fewer bugs. But they both need to be there. A struct solution is essentially an attempt to make strings treated as ranges of dchar in more situations by default than is currenly the case. As such, it could be better than what we have now, but I'm honestly not convinced that it is. Aside from the foreach problem (which could be fixed with an appropriate warning or error -
preferrably error), what we have works quite well.

The thing is, the most common use of strings is as a string, not as an array of code-units. The common case is to print, slice, find, etc. on a *string*. When dealing with the string as a whole, either using an array or a specialized type works equally well.

The uncommon case is to extract individual characters from the string. In this case, the default needs to be the most common need in that area -- extracting a dchar, not a code-unit. Having the default index operation extract a code-unit is very incorrect.

-Steve

Reply via email to