"H. S. Teoh" <[email protected]> wrote in message news:[email protected]... > On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote: >> "James Miller" <[email protected]> wrote in message >> news:[email protected]... >> > I'm writing an introduction/tutorial to using strings in D, paying >> > particular attention to the complexities of UTF-8 and 16. I realised >> > that when you want the number of characters, you normally actually >> > want to use walkLength, not length. Is is reasonable for the >> > compiler to pick this up during semantic analysis and point out this >> > situation? >> > >> > It's just a thought because a lot of the time, using length will get >> > the right answer, but for the wrong reasons, resulting in lurking >> > bugs. You can always cast to immutable(ubyte)[] or >> > immutable(short)[] if you want to work with the actual bytes anyway. >> >> I find that most of the time I actually *do* want to use length. Don't >> know if that's common, though, or if it's just a reflection of my >> particular use-cases. >> >> Also, keep in mind that (unless I'm mistaken) walkLength does *not* >> return the number of "characters" (ie, graphemes), but merely the >> number of code points - which is not the same thing (due to existence >> of the [confusingly-named] "combining characters"). > [...] > > And don't forget that some code points (notably from the CJK block) are > specified as "double-width", so if you're trying to do text layout, > you'll want yet a different length (layoutLength?). >
Interesting. Kinda makes sence that such thing exists, though: The CJK characters (even the relatively simple Japanese *kanas) are detailed enough that they need to be larger to achieve the same readability. And that's the *non*-double-length ones. So I don't doubt there's ones that need to be tagged as "Draw Extra Big!!" :) For example, I have my font size in Windows Notepad set to a comfortable value. But when I want to use hiragana or katakana, I have to go into the settings and increase the font size so I can actually read it (Well, to what *little* extent I can even read it in the first place ;) ). And those kana's tend to be among the simplest CJK characters. (Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for real coding/writing). > So we really need all four lengths. Ain't unicode fun?! :-) > No kidding. The *one* thing I really, really hate about Unicode is the fact that most (if not all) of its complexity actually *is* necessary. Unicode *itself* is undisputably necessary, but I do sure miss ASCII. > Array length is simple. Walklength is already implemented. Grapheme > length requires recognition of 'combining characters' (or rather, > ignoring said characters), and layout length requires recognizing > widthless, single- and double-width characters. > Yup. > I've been thinking about unicode processing recently. Traditionally, we > have to decode narrow strings into UTF-32 (aka dchar) then do table > lookups and such. But unicode encoding and properties, etc., are static > information (at least within a single unicode release). So why bother > with hardcoding tables and stuff at all? > > What we *really* should be doing, esp. for commonly-used functions like > computing various lengths, is to automatically process said tables and > encode the computation in finite-state machines that can then be > optimized at the FSM level (there are known algos for generating optimal > FSMs), codegen'd, and then optimized again at the assembly level by the > compiler. These FSMs will operate at the native narrow string char type > level, so that there will be no need for explicit decoding. > > The generation algo can then be run just once per unicode release, and > everything will Just Work. > While I find that very intersting...I'm afraid I don't actually understand your suggestion :/ (I do understand FSM's and how they work, though) Could you give a little example of what you mean?
