On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.


False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.

* s.any!(c => c == 'ö') works only with autodecoding. It returns always false without.


False. (while this is pretty much the same as 1, one can come up with with as many example as wished by tweaking the same one to produce endless variations).

* s.balancedParens('〈', '〉') works only with autodecoding.


Not sure, so I'll say OK.

* s.canFind('ö') works only with autodecoding. It returns always false without.


False.

* s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result.


False.

* s.count('ö') works only with autodecoding. It returns always zero without.


False.

* s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings.


False.

* s.endsWith('ö') works only with autodecoding. It returns always false without.


False.

* s.endsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.


False.

* s.find('ö') works only with autodecoding. It never finds it without.


False.

* s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things.


Not sure so I'll say OK, while I strongly suspect that, like for other, this will only work if string are normalized.

* s.findAmong(s1) is also interesting. It works only with autodecoding.


False.

* s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results.


False.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding. Otherwise they compile and run but produce incorrect results.


False.

* s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers.


Note sure, so I'll say ok.

* s.minPos, s.maxPos follow a similar semantics.


Note sure, so I'll say ok.

* s.skipOver(s1) only works with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.


False.

* s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.


False.

* s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.


False.

* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range.


False.

===

The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others.


Andrei

I mean what a trainwreck. Your examples are saying it all doesn't it ? Almost none of them would work without normalizing the string first. And that is the point you've been refusing to hear so far. autodecoding doesn't pay for itself as it is unable to do what it is supposed to do in the general case.

Really, there is not much you can do with anything unicode related without first going through normalization. If you want anything more than searching substring or alike, you'll also need a collation, that is locale dependent (for sorting for instance).

Supporting unicode, IMO, would be to provide facilities to normalize (preferably lazilly as a range), to manage collations, and so on. Decoding to codepoints just don't cut it.

As a result, any algorithm that need to support string need to either fight against the language because it doesn't need decoding, use decoding and assume to be incorrect without normalization or do the correct thing by itself (which is also going to require to work against the language).

Reply via email to