Re: The Case Against Autodecode

deadalnix via Digitalmars-d Thu, 02 Jun 2016 12:37:21 -0700

On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescuwrote:

Pretty much everything. Consider s and s1 string variables withpossibly different encodings (UTF8/UTF16).
* s.all!(c => c == 'ö') works only with autodecoding. Itreturns always false without.

False. Many characters can be represented by different sequencesof codepoints. For instance, ê can be ê as one codepoint or ^ asa modifier followed by e. ö is one such character.

* s.any!(c => c == 'ö') works only with autodecoding. Itreturns always false without.

False. (while this is pretty much the same as 1, one can come upwith with as many example as wished by tweaking the same one toproduce endless variations).

* s.balancedParens('〈', '〉') works only with autodecoding.


Not sure, so I'll say OK.

* s.canFind('ö') works only with autodecoding. It returnsalways false without.


False.

* s.commonPrefix(s1) works only if they both use the sameencoding; otherwise it still compiles but silently produces anincorrect result.


False.

* s.count('ö') works only with autodecoding. It returns alwayszero without.


False.

* s.countUntil(s1) is really odd - without autodecoding,whether it works at all, and the result it returns, depends onboth encodings. With autodecoding it always works and returns anumber independent of the encodings.


False.

* s.endsWith('ö') works only with autodecoding. It returnsalways false without.


False.

* s.endsWith(s1) works only with autodecoding. Otherwise itcompiles and runs but produces incorrect results if s and s1have different encodings.


False.

* s.find('ö') works only with autodecoding. It never finds itwithout.


False.

* s.findAdjacent is a very interesting one. It works withautodecoding, but without it it just does odd things.

Not sure so I'll say OK, while I strongly suspect that, like forother, this will only work if string are normalized.

* s.findAmong(s1) is also interesting. It works only withautodecoding.


False.

* s.findSkip(s1) works only if s and s1 have the same encoding.Otherwise it compiles and runs but produces incorrect results.


False.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1)work only if s and s1 have the same encoding. Otherwise theycompile and run but produce incorrect results.


False.

* s.minCount, s.maxCount are unlikely to be terribly useful butwith autodecoding it consistently returns the extremum numericcode unit regardless of representation. Without, they justreturn encoding-dependent and meaningless numbers.


Note sure, so I'll say ok.

* s.minPos, s.maxPos follow a similar semantics.


Note sure, so I'll say ok.

* s.skipOver(s1) only works with autodecoding. Otherwise itcompiles and runs but produces incorrect results if s and s1have different encodings.


False.

* s.startsWith('ö') works only with autodecoding. Otherwise itcompiles and runs but produces incorrect results if s and s1have different encodings.


False.

* s.startsWith(s1) works only with autodecoding. Otherwise itcompiles and runs but produces incorrect results if s and s1have different encodings.


False.

* s.until!(c => c == 'ö') works only with autodecoding.Otherwise, it will span the entire range.


False.

===
The intent of autodecoding was to make std.algorithm workmeaningfully with strings. As it's easy to see I just wentthrough std.algorithm.searching alphabetically and found issuesliterally with every primitive in there. It's an easy exerciseto go forth with the others.
Andrei

I mean what a trainwreck. Your examples are saying it all doesn'tit ? Almost none of them would work without normalizing thestring first. And that is the point you've been refusing to hearso far. autodecoding doesn't pay for itself as it is unable to dowhat it is supposed to do in the general case.

Really, there is not much you can do with anything unicoderelated without first going through normalization. If you wantanything more than searching substring or alike, you'll also needa collation, that is locale dependent (for sorting for instance).

Supporting unicode, IMO, would be to provide facilities tonormalize (preferably lazilly as a range), to manage collations,and so on. Decoding to codepoints just don't cut it.

As a result, any algorithm that need to support string need toeither fight against the language because it doesn't needdecoding, use decoding and assume to be incorrect withoutnormalization or do the correct thing by itself (which is alsogoing to require to work against the language).

Re: The Case Against Autodecode

Reply via email to