On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu
wrote:
On 4/18/15 4:35 AM, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the
accented e
all have Unicode code point assignments.
This code snippet demonstrates the problem:
import std.stdio;
void main ()
{
dstring a = "e\u0301";
dstring b = "é";
assert(a != b);
assert(a.length == 2);
assert(b.length == 1);
writefln(a, " ", b);
}
If you run the above code all asserts should pass. If your
system
correctly supports Unicode (works on OS X 10.10) the two
printed
characters should look exactly the same.
\u0301 is the "combining acute accent" [1].
[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Isn't this solved commonly with a normalization pass? We should
have a normalizeUTF() that can be inserted in a pipeline. Then
the rest of Phobos doesn't need to mind these combining
characters. -- Andrei
Normalisation can allow some simplifications, sometimes, but
knowing whether it will or not requires a lot of a priori
knowledge about the input as well as the normalisation form.