Re: Today's programming challenge - How's your Range-Fu ?

Chris via Digitalmars-d Sat, 18 Apr 2015 07:06:16 -0700

On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:

On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris viaDigitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborgwrote:
>On 2015-04-18 12:27, Walter Bright wrote:
>
>>That doesn't make sense to me, because the umlauts and the>>accented
>>e all have Unicode code point assignments.
>
>This code snippet demonstrates the problem:
>
>import std.stdio;
>
>void main ()
>{
>    dstring a = "e\u0301";
>    dstring b = "é";
>    assert(a != b);
>    assert(a.length == 2);
>    assert(b.length == 1);
>    writefln(a, " ", b);
>}
>
>If you run the above code all asserts should pass. If your>system>correctly supports Unicode (works on OS X 10.10) the two>printed
>characters should look exactly the same.
>
>\u0301 is the "combining acute accent" [1].
>
>[1]>http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. Thething isyou never know, if a text is composed or decomposed, so youhave to beprepared that "é" has length 2 or 1. On OS X these charactersareautomatically decomposed by default. So if you pipe it throughthesystem an "é" (length=1) automatically becomes "e\u0301"(length=2).Same goes for file names on OS X. I've had to find aworkaround for
this more than once.
Wait, I thought the recommended approach is to normalize first,then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing codeto use auniform approach to handling the string. I don't think it's agood ideato manually deal with composed/decomposed issues within everyindividual
string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because notevery
language has precomposed characters handy.
Using byGrapheme, within the current state of Phobos, is stillthe bestbet as to correctly counting the number of printed columns asopposed tothe number of "characters" (which, in the Unicode definition,does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.
Well, to be fair, byGrapheme only *occasionally* allocates --only forinput with unusually long sequences of combining diacritics --fornormal use cases you'll pretty much never have any allocations.But thelanguage can't express the idea of "occasionally allocates",there isonly "allocates" or "@nogc". Which makes it unusable in @nogccode.
One possible solution would be to modify std.uni.graphemeStrideto notallocate, since it shouldn't need to do so just to compute thelength of
the next grapheme.


T

This is why on OS X I always normalized strings to composed.However, there are always issues with Unicode, because, as yousaid, the layman's notion of what a character is is not the sameas Unicode's. I wrote a utility function that uses byGrapheme andbyCodePoint. It's a bit of an overhead, but I always get thecorrect length and character access (e.g. if txt.startsWith("é")).

Re: Today's programming challenge - How's your Range-Fu ?

Reply via email to