On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).
* s.all!(c => c == 'ö') works only with autodecoding. It
returns always false without.
False. Many characters can be represented by different sequences
of codepoints. For instance, ê can be ê as one codepoint or ^ as
a modifier followed by e. ö is one such character.
* s.any!(c => c == 'ö') works only with autodecoding. It
returns always false without.
False. (while this is pretty much the same as 1, one can come up
with with as many example as wished by tweaking the same one to
produce endless variations).
* s.balancedParens('〈', '〉') works only with autodecoding.
Not sure, so I'll say OK.
* s.canFind('ö') works only with autodecoding. It returns
always false without.
False.
* s.commonPrefix(s1) works only if they both use the same
encoding; otherwise it still compiles but silently produces an
incorrect result.
False.
* s.count('ö') works only with autodecoding. It returns always
zero without.
False.
* s.countUntil(s1) is really odd - without autodecoding,
whether it works at all, and the result it returns, depends on
both encodings. With autodecoding it always works and returns a
number independent of the encodings.
False.
* s.endsWith('ö') works only with autodecoding. It returns
always false without.
False.
* s.endsWith(s1) works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.
False.
* s.find('ö') works only with autodecoding. It never finds it
without.
False.
* s.findAdjacent is a very interesting one. It works with
autodecoding, but without it it just does odd things.
Not sure so I'll say OK, while I strongly suspect that, like for
other, this will only work if string are normalized.
* s.findAmong(s1) is also interesting. It works only with
autodecoding.
False.
* s.findSkip(s1) works only if s and s1 have the same encoding.
Otherwise it compiles and runs but produces incorrect results.
False.
* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1)
work only if s and s1 have the same encoding. Otherwise they
compile and run but produce incorrect results.
False.
* s.minCount, s.maxCount are unlikely to be terribly useful but
with autodecoding it consistently returns the extremum numeric
code unit regardless of representation. Without, they just
return encoding-dependent and meaningless numbers.
Note sure, so I'll say ok.
* s.minPos, s.maxPos follow a similar semantics.
Note sure, so I'll say ok.
* s.skipOver(s1) only works with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.
False.
* s.startsWith('ö') works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.
False.
* s.startsWith(s1) works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.
False.
* s.until!(c => c == 'ö') works only with autodecoding.
Otherwise, it will span the entire range.
False.
===
The intent of autodecoding was to make std.algorithm work
meaningfully with strings. As it's easy to see I just went
through std.algorithm.searching alphabetically and found issues
literally with every primitive in there. It's an easy exercise
to go forth with the others.
Andrei
I mean what a trainwreck. Your examples are saying it all doesn't
it ? Almost none of them would work without normalizing the
string first. And that is the point you've been refusing to hear
so far. autodecoding doesn't pay for itself as it is unable to do
what it is supposed to do in the general case.
Really, there is not much you can do with anything unicode
related without first going through normalization. If you want
anything more than searching substring or alike, you'll also need
a collation, that is locale dependent (for sorting for instance).
Supporting unicode, IMO, would be to provide facilities to
normalize (preferably lazilly as a range), to manage collations,
and so on. Decoding to codepoints just don't cut it.
As a result, any algorithm that need to support string need to
either fight against the language because it doesn't need
decoding, use decoding and assume to be incorrect without
normalization or do the correct thing by itself (which is also
going to require to work against the language).