Re: Why the hell doesn't foreach decode strings

Steven Schveighoffer Mon, 24 Oct 2011 12:45:44 -0700

On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kja...@gmail.com> wrote:

On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer<schvei...@yahoo.com> wrote:
On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright<newshou...@digitalmars.com> wrote:
On 10/22/2011 2:21 AM, Peter Alexander wrote:
Which operations do you believe would be less efficient?
All of the ones that don't require decoding, such as searching, wouldbe less efficient if decoding was done.
Searching that does not do decoding is fundamentally incorrect. Thatis, if you want to find a substring in a string, you cannot justcompare chars.
Assuming both string are valid UTF-8, you can. Continuation bytes canneverbe confused with the first byte of a code point, and the first bytealways
identifies how many continuation bytes there should be.


As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*.  They must
be normalized to accurately compare.

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.  If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding!  Or if fiancé uses a
precomposed é, it won't match.  So two valid representations of the word
either match or they don't.  It's just a complete mess without proper
unicode decoding.

-Steve

Re: Why the hell doesn't foreach decode strings

Reply via email to