On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <[email protected]>
wrote:
On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
<[email protected]>
wrote:
> 1. At least until universal function syntax is in the language, you
use
> the
> ability to do str.func().
Yes, but first, this is not a problem with the string type, it's a
problem
with the language. Second, any string-specific functions can be added,
it
is a struct after all, not a builtin.
I really don't think that we want to have to be adding string functions
to a
string struct. Having them external is far better, particularly when
there are
so many of them which aren't directly related to strings but use them
heavily.
Even if all of std.string got tacked onto a struct string type, that
would leave
out the rest of Phobos and any user code. I really think that universal
function
syntax is a necessity for a struct solution to be acceptable.
Why? It's a choice between adding them to the struct, and doing func(str)
instead. You already have to do func(R) for arbitrary ranges, I don't see
why strings should be specialized. For things that should be considered
'built-in' for strings, they can be included as members of the struct.
In other words, I think it's worth fixing strings now, even if it means we
cannot call str.func() until the universal syntax is introduced.
I had considered a solution similar to this one a few months back, and
the lack
of universal function syntax is one of the reasons why I decided that it
wasn't
really an improvement. Honestly, without uniform function call syntax, I
would
consider a struct solution to be DOA. The loss in useability would just
be too
great.
But you can still call the function, it's just func(str). There is no
loss of functionality. Usability isn't even really lost.
> 2. Functions that would work just fine treating strings as arrays of
> code units
> (presumably because they don't care about what the actual data is)
lose
> efficiency, because now a string isn't an array.
Which functions are those? They can be allowed via wrappers.
It is my understanding that there are various functions in std.algorithm
which
are able to treat strings as arrays and therefore process them more
efficiently. I
haven't looked into which ones are in that camp. I would think that
find() might
be, but I'd have to look.
Since std.algorithm works exclusively with ranges, those must be special
cases for strings because strings are bi-directional ranges of dchar
according to phobos. So those can continue to be special cases for the
new string types.
> 4. Indexing is no longer O(1), which violates the guarantees of the
index
> operator.
Indexing is still O(1).
> 5. Slicing (other than a full slice) is no longer O(1), which violates
> the
> guarantees of the slicing operator.
Slicing is still O(1).
You're right. I misread what _charStart() did. However, if I understand
it
correctly now, you give it a code unit index and yet get a code point
back. That
worries me. It means that there is no relation between the length of the
string
and the indices that you'd use to index it. It _is_ related to the
number of
code units, which you can get with codeUnits(), but that disjoint seems
like it
could cause problems. It might be the correct solution, but I think that
it
merits some examination. Returning a code unit would be wrong since that
totally
breaks with the rest of the API dealing in dchar/code units, and it
would be
wrong to index by code point, since then indexing and slicing are no
longer
O(1). So, it's probably a choice between indexing by one thing and
returning
another or not having indexing and slicing, which would definitely not
be good.
So, maybe your solution is the best one in this respect, but it worries
me. The
exact ramifications of that need to be looked into.
How many times do you use a hard-coded index into a string without knowing
the encoding of the string (i.e. I know this string is ascii)? How many
times do you iterate the characters of a string via an incrementing index?
It's not common to use the indexing operation with values that are not
known or computed to be valid starts to code-points. In fact, the
language depends on that (otherwise you'd see sliced strings everywhere
with invalid data).
So while it looks strange, it shouldn't be a common need.
That being said, I think Lars pointed out that the strangeness of
returning the code point even if you point in the middle would be
surprising in some cases, so I think the better solution is to throw an
exception.
> What you're doing here is forcing the view of a string as a range of
> dchar in
> all cases. Granted, that's what you want in most cases, but it can
> degrade
> efficiency, and the fact that some operations (in particular indexing
> and slicing)
> are not O(1) like they're supposed to be means that algorithms which
> rely on
> O(1) behavior from them could increase their cost by an order of
> magnitude. All
> the cases where treating a string as an actual array which are
currently
> valid
> are left out to dry
You can still use char[] and wchar[].
Except that what if you need to do both with the same type? Right now,
you could
have a function which treats a string as a range of dchar while another
one
which can get away with treating it as a range of code units can treat
it as an
array. You can pass the same string to both, and it works. That should
still
work if we go for a struct solution. Special-casing on strings and
specifically
using the internal array instead of the struct for them could fix the
problem,
but it still needs to be possible.
Yeah, it's definitely needed. I'll add access to the data member in the
next version.
Hopefully you can see that I'm not eliminating the functionality you are
looking for, just making it not the default.
There is definitely some value in making strings treated as ranges of
dchar by
default. But for the most part, that's the case already thanks to how
std.array
is defined. The only place where that runs into trouble is if you use
foreach.
or indexing. This is a huge problem.
It
still treats them as arrays of code units unless you tell it to iterate
over
dchar. Either making foreach over character arrays iterate over dchar by
default
or making it a warning or error to use foreach with a char or wchar
array of any
kind without specifying the type to iterate over would fix that problem.
I agree that would be ideal, but it still doesn't solve the indexing
problem.
There is an inherent but necessary disjoint between having strings be
arrays of
code units and ranges of dchar. Sometimes they need to be treated as one,
sometimes as the other. Ideally, the default - whichever it is - would
be the
one which leads to fewer bugs. But they both need to be there. A struct
solution
is essentially an attempt to make strings treated as ranges of dchar in
more
situations by default than is currenly the case. As such, it could be
better
than what we have now, but I'm honestly not convinced that it is. Aside
from the
foreach problem (which could be fixed with an appropriate warning or
error -
preferrably error), what we have works quite well.
The thing is, the most common use of strings is as a string, not as an
array of code-units. The common case is to print, slice, find, etc. on a
*string*. When dealing with the string as a whole, either using an array
or a specialized type works equally well.
The uncommon case is to extract individual characters from the string. In
this case, the default needs to be the most common need in that area --
extracting a dchar, not a code-unit. Having the default index operation
extract a code-unit is very incorrect.
-Steve