On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
>On 2015-04-18 12:27, Walter Bright wrote:
>
>>That doesn't make sense to me, because the umlauts and the >>accented
>>e all have Unicode code point assignments.
>
>This code snippet demonstrates the problem:
>
>import std.stdio;
>
>void main ()
>{
>    dstring a = "e\u0301";
>    dstring b = "é";
>    assert(a != b);
>    assert(a.length == 2);
>    assert(b.length == 1);
>    writefln(a, " ", b);
>}
>
>If you run the above code all asserts should pass. If your >system >correctly supports Unicode (works on OS X 10.10) the two >printed
>characters should look exactly the same.
>
>\u0301 is the "combining acute accent" [1].
>
>[1] >http://www.fileformat.info/info/unicode/char/0301/index.htm

Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for
this more than once.

Wait, I thought the recommended approach is to normalize first, then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual
string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not every
language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or "@nogc". Which makes it unusable in @nogc code.

One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.


T

This is why on OS X I always normalized strings to composed. However, there are always issues with Unicode, because, as you said, the layman's notion of what a character is is not the same as Unicode's. I wrote a utility function that uses byGrapheme and byCodePoint. It's a bit of an overhead, but I always get the correct length and character access (e.g. if txt.startsWith("é")).

Reply via email to