On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote: > But consider the case where you are searching the string: "cassé" > > for the letter 'e'. If é is encoded as 'e' + U+0301, then you will > succeed when you should fail! However, it may be that you actually want > to find specifically any code points with 'e', including ones with > combining characters. This is why we really need more discretion from > Phobos, and less hand-holding. > > There are certainly searches that will be correct. For example, > searching for newline should always work in code-point space. Actually, > what happens when you use a combining character on newline? Is it an > invalid unicode sequence? Does it matter? :) > > A nice function to determine whether code points or graphemes are > required for comparison given a needle may be useful.
Well, if you know that you're dealing with a grapheme that has that problem, you can just iterate by graphemes rather than code units like find would normally. Otherwise, what you probably end up doing is searching for the needle and then verifying that the resultant range starts with the right grapheme and not just the right code point and then call find again to search further into the range if it was just the right code point. Regardless, I don't see how find is really going to solve this for you unless it either assumes that you want to deal with graphemes and converts everything to graphemes, or it assumes that you want graphemes and converts to graphemes when it finds a possible match and the only considers it a match if it's a match at the graphem level. The latter wouldn't be expensive in most cases, but it _would_ be assuming that you want to operate on graphemes even though you have a range of code units or code points, and that's not necessarily the case. You might actually want to find the code units or code points in question and not care about graphemes (much as that's not likely to be typical). That could still be acceptable if we decided that you needed to use a range of ubyte/ushort/uint rather than a range of char/wchar/dchar in the case where you actually want to look for code units or code points rather than searching for a grapheme within a range of code units or code points. But even if we don't take graphemes into account at all with a function like find, encoding the needle and searching with code units shouldn't be a problem. It's just that the programmer needs to be aware that they might end up finding only a partial grapheme if they're not careful. The alternative is to not allow searching for needles of one character type inside a haystack of another character type and force the programmer to to the encoding rather than having find to it. And that wouldn't be the end of the world, but it wouldn't be as user-friendly, and I'm not sure that it would be a great idea given that we currently can do those comparisons thanks to auto-decoding, and we'd effectively be losing functionality if it didn't work with other ranges of characters (or with strings if/once auto-decoding is killed off). Ultimately, we need to make sure that we don't prevent the programmer for handling Unicode correctly or make it more difficult in an attempt to make it easier for the programmer (which is essentially what auto-decoding does), but that doesn't mean that there aren't cases where we can bake-in some Unicode handling into functions to increase efficiency without losing out on correctness. And making find encode the needle so that it can compare at the code unit level doesn't lose out on correctness. It just isn't sufficient for full correctness on its own. - Jonathan M Davis
