On Sat, Dec 06, 2014 at 10:37:17PM +0000, "Nordlöw" via Digitalmars-d-learn wrote: > Given the fact that > > static assert("é".length == 2); > > I was surprised that > > static assert("é".byCodeUnit.length == 2); > static assert("é".byCodePoint.length == 2); > > Isn't there a way to iterate over accented characters (in my case > UTF-8) in D? Or is this an inherent problem in Unicode? I need this in > a syllable counting algorithm that needs to distinguish accented and > non-accented variants of vowels. For example café (2 syllables) > compared to babe (one syllable.
This is a Unicode issue. What you want is neither byCodeUnit nor byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of what lay people would call a "character". A Unicode character (or more precisely, a "code point") is not necessarily a complete grapheme, as your example above shows; it's just a numerical value that uniquely identifies an entry in the Unicode character database. T -- There are 10 kinds of people in the world: those who can count in binary, and those who can't.