Re: Major performance problem with std.array.front()

w0rp Sun, 09 Mar 2014 04:51:07 -0700

On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:

I'm leaning the same way too. But I also think Andrei is rightthat, at this point in time, it'd be a terrible move to changethings so that "by code unit" is default. For better or worse,that ship has sailed.
Perhaps we *can* deal with the auto-decoding problem not bykilling auto-decoding, but by marginalizing it in an additiveway:
Convincing arguments have been made that any string-processingcode which *isn't* done entirely with the official Unicodealgorithms is likely wrong *regardless* of whetherstd.algorithm defaults to per-code-unit or per-code-point.
So...How's this?: We add any of these Unicode algorithms we maybe missing, encourage their use for strings, discourage use ofstd.algorithm for string processing, and in the meantime, justdo our best to reduce unnecessary decoding wherever possible.Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'mkind of a nobody jumping in pretty late, but I think afterthinking about the problem for a while I would aggree on asolution along the lines of what you have suggested.

I think Vladimir is definitely right when he's saying that whenyou have algorithms that deal with natural languages, simplyworking on the basis of a code unit isn't enough. I think it isalso true that you need to select a particular algorithm fordealing with strings of characters, as there are many differentalgorithms you can use for different languages which behavedifferently, perhaps several in a single langauge. I also thinkAndrei is right when he is saying we need to minimise codebreakage, and that the string decoding and encoding by defaultisn't the biggest of performance problems.

I think our best option is to offer a function which creates arange in std.array for getting a range over raw character data,without decoding to code points.

myArray.someAlgorithm; // std.array .front used today with decodecalls

myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processingwhich don't use the existing dchar abstraction.

myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Rangeof strings, maybe range of range of characters, not dchars

Or even specialise the new algorithm so it looks for arrays andturns them into the ranges for you via the transformation myArray-> myArray.rawData.


myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimirand not myself, because he's spent far more time looking intoUnicode processing than myself. My knowledge of Unicode prettymuch just comes from having to deal with foreign languagecustomers and discovering the problems with the code unitabstraction most languages seem to use. (Java and Python sufferfrom similar issues, but they don't really have algorithms in theway that we do.)

This new set of algorithms taking settings for differentencodings could be first implemented in a third party library,tested there, and eventually submitted to Phobos, probably instd.string.


There's my input, I'll duck before I'm beheaded.

Re: Major performance problem with std.array.front()

Reply via email to