On 27.04.2012 1:23, H. S. Teoh wrote:
On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
"James Miller"<[email protected]> wrote in message
news:[email protected]...
I'm writing an introduction/tutorial to using strings in D, paying
particular attention to the complexities of UTF-8 and 16. I realised
that when you want the number of characters, you normally actually
want to use walkLength, not length. Is is reasonable for the
compiler to pick this up during semantic analysis and point out this
situation?
It's just a thought because a lot of the time, using length will get
the right answer, but for the wrong reasons, resulting in lurking
bugs. You can always cast to immutable(ubyte)[] or
immutable(short)[] if you want to work with the actual bytes anyway.
I find that most of the time I actually *do* want to use length. Don't
know if that's common, though, or if it's just a reflection of my
particular use-cases.
Also, keep in mind that (unless I'm mistaken) walkLength does *not*
return the number of "characters" (ie, graphemes), but merely the
number of code points - which is not the same thing (due to existence
of the [confusingly-named] "combining characters").
[...]
And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).
So we really need all four lengths. Ain't unicode fun?! :-)
Array length is simple. Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.
I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?
Of course they are generated.
What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs),
FSA are based on tables so it's all runs in the circle. Only the layout
changes. Yet the speed gains of non-decoding are huge.
codegen'd, and then optimized again at the assembly level by the
compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.
The generation algo can then be run just once per unicode release, and
everything will Just Work.
This year Unicode in D will receive a nice upgrade.
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#
Anyway keep me posted if you have these FSA ever come to soil your sleep ;)
--
Dmitry Olshansky