Re: Unicode Normalization (and graphemes and locales)

Steven Schveighoffer via Digitalmars-d Fri, 03 Jun 2016 05:26:36 -0700

On 6/3/16 8:06 AM, Jonathan M Davis via Digitalmars-d wrote:

On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:

But consider the case where you are searching the string: "cassé"


for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
succeed when you should fail! However, it may be that you actually want
to find specifically any code points with 'e', including ones with
combining characters. This is why we really need more discretion from
Phobos, and less hand-holding.

There are certainly searches that will be correct. For example,
searching for newline should always work in code-point space. Actually,
what happens when you use a combining character on newline? Is it an
invalid unicode sequence? Does it matter? :)

A nice function to determine whether code points or graphemes are
required for comparison given a needle may be useful.


Well, if you know that you're dealing with a grapheme that has that problem,
you can just iterate by graphemes rather than code units like find would
normally.

Yes, I agree. This is exactly the point. Don't assume anything, justtreat a type as it is written. And tell the user this!

If you are going to search a range of code points with a code point, youmay not get what you expect. If you want to do a grapheme-aware search,change it to a range of graphemes, and do a grapheme search.

What I was trying say with my example is that searching by code points,even for graphemes that definitively fit into one code point, may stillnot be correct in all use cases.


-Steve

Re: Unicode Normalization (and graphemes and locales)

Reply via email to