On 3/7/14, 12:26 PM, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
s.canFind('é') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain é. Rather, canFind, et al should convert the dchar literal
'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.

That's an optimization that fits the current design and goes in the library transparently, i.e. the good stuff.

5.

s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does.

Depends on what one expects :o).

Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a bunch of algorithms to work with graphemes.

s.endsWith('é') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding.
Ideally, you should have 3 overloads:

        bool endsWith(string s, char asciiChar);
        bool endsWith(string s, wchar wideChar);
        bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise with things like bool endsWith(string, wstring) etc.

[...]
I designed the range behavior of strings after much thinking and
consideration back in the day when I designed std.algorithm. It was
painfully obvious (but it seems to have been forgotten now that it's
working so well) that approaching strings as arrays of char[] would
break almost every single algorithm leaving us essentially in the
pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people are happy campers with the way D's strings work, including UTF support and performance. I don't remember people bringing this up in forums and here at Facebook "yeah, just look at the crappy way they handle strings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than 2x baseline would have been a failure. In that app (which I know very well) the right option from day 1 would have been ubyte[], which he discovered the hard way. His incomplete understanding of how D strings work is the single largest problem there, and indicates an issue with the documentation.

He discovered that, was surprised, and overreacted. No need to amplify that into mass hysteria. There are improvements that can be made, in the form of additions, not breaking changes that would inflict massive breakage on the community. This is the way in which this discussion can have a positive outcome. (I've shared in fact a few ideas with Walter.)

Clearly one might argue that their app has no business dealing with
diacriticals or Asian characters. But that's the typical provincial
view that marred many languages' approach to UTF and
internationalization. If you know your string is ASCII, the remedy
is simple - don't use char[] and friends. From day 1, the type
"char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what you'd be casting to and fro. You have absolutely no business representing anything non-UTF with char and char[] etc.

So please ponder the above before going to do surgery on the patient
that's going to kill him.
[...]

Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is but a little side show. We have way bigger fish to fry.


Andrei

Reply via email to