Re: Major performance problem with std.array.front()

Andrei Alexandrescu Fri, 07 Mar 2014 16:46:33 -0800

On 3/7/14, 12:26 PM, H. S. Teoh wrote:

On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:

s.canFind('é') currently works as expected. Proposed: fails silently.


The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain é. Rather, canFind, et al should convert the dchar literal
'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.

That's an optimization that fits the current design and goes in thelibrary transparently, i.e. the good stuff.

5.

s.count() currently works as expected. Proposed: fails silently.


Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does.


Depends on what one expects :o).

Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing abunch of algorithms to work with graphemes.

s.endsWith('é') currently works as expected. Proposed: fails silently.


Arguable, because it imposes a performance hit by needless decoding.
Ideally, you should have 3 overloads:

        bool endsWith(string s, char asciiChar);
        bool endsWith(string s, wchar wideChar);
        bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arisewith things like bool endsWith(string, wstring) etc.

[...]

I designed the range behavior of strings after much thinking and
consideration back in the day when I designed std.algorithm. It was
painfully obvious (but it seems to have been forgotten now that it's
working so well) that approaching strings as arrays of char[] would
break almost every single algorithm leaving us essentially in the
pre-UTF C++aveman era.


I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is peopleare happy campers with the way D's strings work, including UTF supportand performance. I don't remember people bringing this up in forums andhere at Facebook "yeah, just look at the crappy way they handlestrings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than2x baseline would have been a failure. In that app (which I know verywell) the right option from day 1 would have been ubyte[], which hediscovered the hard way. His incomplete understanding of how D stringswork is the single largest problem there, and indicates an issue withthe documentation.

He discovered that, was surprised, and overreacted. No need to amplifythat into mass hysteria. There are improvements that can be made, in theform of additions, not breaking changes that would inflict massivebreakage on the community. This is the way in which this discussion canhave a positive outcome. (I've shared in fact a few ideas with Walter.)

Clearly one might argue that their app has no business dealing with
diacriticals or Asian characters. But that's the typical provincial
view that marred many languages' approach to UTF and
internationalization. If you know your string is ASCII, the remedy
is simple - don't use char[] and friends. From day 1, the type
"char" was meant to mean "code unit of UTF characters".


Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see whatyou'd be casting to and fro. You have absolutely no businessrepresenting anything non-UTF with char and char[] etc.

So please ponder the above before going to do surgery on the patient
that's going to kill him.

[...]

Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation isbut a little side show. We have way bigger fish to fry.



Andrei

Re: Major performance problem with std.array.front()

Reply via email to