Am Thu, 26 May 2016 16:23:16 -0700 schrieb "H. S. Teoh via Digitalmars-d" <[email protected]>:
> On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via > Digitalmars-d wrote: > [...] > > s.walkLength > > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation > > s.count!(c => c >= 32) // non-control characters > > Question: what should count return, given a string containing (1) > combining diacritics, or (2) Korean text? Or (3) zero-width spaces? > > > > Currently the standard library operates at code point level even > > though inside it may choose to use code units when admissible. Leaving > > such a decision to the library seems like a wise thing to do. > > The problem is that often such decisions can only be made by the user, > because it depends on what the user wants to accomplish. What should > count return, given some Unicode string? If the user wants to determine > the size of a buffer (e.g., to store a string minus some characters to > be stripped), then count should return the byte count. If the user wants > to count the number of matching visual characters, then count should > return the number of graphemes. If the user wants to determine the > visual width of the (filtered) string, then count should not be used at > all, but instead a font metric algorithm. (I can't think of a practical > use case where you'd actually need to count code points(!).) Hey, I was about to answer exactly the same. It reminds me that a few years ago I proposed making string iteration explicit by code-unit, code-point and grapheme in "Rust" and there was virtually no debate about doing it in the sense that to enable people to write correct code they'd need to understand a bit of Unicode and pick the right primitive. If you don't know what to pick you look it up. -- Marco
