On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:

I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that "by code unit" is default. For better or worse, that ship has sailed.

Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way:

Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point.

So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'm kind of a nobody jumping in pretty late, but I think after thinking about the problem for a while I would aggree on a solution along the lines of what you have suggested.

I think Vladimir is definitely right when he's saying that when you have algorithms that deal with natural languages, simply working on the basis of a code unit isn't enough. I think it is also true that you need to select a particular algorithm for dealing with strings of characters, as there are many different algorithms you can use for different languages which behave differently, perhaps several in a single langauge. I also think Andrei is right when he is saying we need to minimise code breakage, and that the string decoding and encoding by default isn't the biggest of performance problems.

I think our best option is to offer a function which creates a range in std.array for getting a range over raw character data, without decoding to code points.

myArray.someAlgorithm; // std.array .front used today with decode calls
myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processing which don't use the existing dchar abstraction.

myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range of strings, maybe range of range of characters, not dchars

Or even specialise the new algorithm so it looks for arrays and turns them into the ranges for you via the transformation myArray -> myArray.rawData.

myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimir and not myself, because he's spent far more time looking into Unicode processing than myself. My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.)

This new set of algorithms taking settings for different encodings could be first implemented in a third party library, tested there, and eventually submitted to Phobos, probably in std.string.

There's my input, I'll duck before I'm beheaded.

Reply via email to