On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via
Digitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg
wrote:
>On 2015-04-18 12:27, Walter Bright wrote:
>
>>That doesn't make sense to me, because the umlauts and the
>>accented
>>e all have Unicode code point assignments.
>
>This code snippet demonstrates the problem:
>
>import std.stdio;
>
>void main ()
>{
> dstring a = "e\u0301";
> dstring b = "é";
> assert(a != b);
> assert(a.length == 2);
> assert(b.length == 1);
> writefln(a, " ", b);
>}
>
>If you run the above code all asserts should pass. If your
>system
>correctly supports Unicode (works on OS X 10.10) the two
>printed
>characters should look exactly the same.
>
>\u0301 is the "combining acute accent" [1].
>
>[1]
>http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. The
thing is
you never know, if a text is composed or decomposed, so you
have to be
prepared that "é" has length 2 or 1. On OS X these characters
are
automatically decomposed by default. So if you pipe it through
the
system an "é" (length=1) automatically becomes "e\u0301"
(length=2).
Same goes for file names on OS X. I've had to find a
workaround for
this more than once.
Wait, I thought the recommended approach is to normalize first,
then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code
to use a
uniform approach to handling the string. I don't think it's a
good idea
to manually deal with composed/decomposed issues within every
individual
string function.
Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not
every
language has precomposed characters handy.
Using byGrapheme, within the current state of Phobos, is still
the best
bet as to correctly counting the number of printed columns as
opposed to
the number of "characters" (which, in the Unicode definition,
does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.
Well, to be fair, byGrapheme only *occasionally* allocates --
only for
input with unusually long sequences of combining diacritics --
for
normal use cases you'll pretty much never have any allocations.
But the
language can't express the idea of "occasionally allocates",
there is
only "allocates" or "@nogc". Which makes it unusable in @nogc
code.
One possible solution would be to modify std.uni.graphemeStride
to not
allocate, since it shouldn't need to do so just to compute the
length of
the next grapheme.
T
This is why on OS X I always normalized strings to composed.
However, there are always issues with Unicode, because, as you
said, the layman's notion of what a character is is not the same
as Unicode's. I wrote a utility function that uses byGrapheme and
byCodePoint. It's a bit of an overhead, but I always get the
correct length and character access (e.g. if txt.startsWith("é")).