28-Sep-2014 23:44, Uranuz пишет:
I totally agree with all of that.
It's one of those cases where correct by default is far too slow (that
would have to be graphemes) but fast by default is far too broken.
Better to force an explicit choice.
There is no magic bullet for unicode in a systems language such as D.
The programmer must be aware of it and make choices about how to treat
it.
I see didn't know about difference between byCodeUnit and
byGrapheme, because I speak Russian and it is close to English,
because it doesn't have diacritics. As far as I remember German,
that I learned at school have diacritics. So you opened my eyes
in this question. My position as usual programmer is that I
speaking language which graphemes coded by 2 bytes
In UTF-16 and UTF-8.
and I alwas
need to do decoding otherwise my programme will be broken. Other
possibility is to use wstring or dstring, but it is less memory
efficient. Also UTF-8 is more commonly used in the Internet so I
don't want to do some conversions to UTF-32, for example.
Where I could read about byGrapheme?
std.uni docs:
http://dlang.org/phobos/std_uni.html#.byGrapheme
Isn't this approach
overcomplicated? I don't want to write Dostoevskiy's book "War
and Peace" in order to write some parser for simple DSL.
It's Tolstoy actually:
http://en.wikipedia.org/wiki/War_and_Peace
You don't need byGrapheme for simple DSL. In fact as long as DSL is
simple enough (ASCII only) you may safely avoid decoding. If it's in
Russian you might want to decode. Even in this case there are ways to
avoid decoding, it may involve a bit of writing in as for typical short
novel ;)
In fact I did a couple of such literature exercises in std library.
For codepoint lookups on non-decoded strings:
http://dlang.org/phobos/std_uni.html#.utfMatcher
And to create sets of codepoints to detect with matcher:
http://dlang.org/phobos/std_uni.html#.CodepointSet
--
Dmitry Olshansky