Am Thu, 09 Jan 2014 15:51:36 -0500 schrieb Jerry <[email protected]>:
> Marco Leise <[email protected]> writes: > > > Am Thu, 09 Jan 2014 15:20:13 +0000 > > schrieb "John Colvin" <[email protected]>: > > > > > The point about graphemes is good. D's functions still stop > > mid-way. From UTF-8 you can iterate UTF-32 code points, but > > grapheme clusters are the new characters. I.e. the basic need > > to iterate Unicode _characters_ is not supported! > > I cannot even come up with use cases for working with code > > points and think they are a conceptual black hole. Something > > carried over from a time when grapheme clusters didn't exist. > > Actually, you can do tons of NLP without grapheme clusters. If you're > paranoid, you standardize on a specific Unicode normalization first. > > You can probably get a bit better results by paying attention to > clusters, but I suspect it will be a marginal improvement. > > That said, I do agree with the OP that the string API is currently more > complex to understand than I'd like. However, it's significantly easier > to use than what's in standard C++ for anything beyond ascii. > > Jerry Sorry, I got confused with the Unicode definitions. I see now that a grapheme cluster is e.g. \r\n. What I really meant is that Phobos needs to support graphemes. But seeing that monsters like this exist: n͠g, I don't even know if this is one character or two, but right now Phobos sees it as three characters. -- Marco
