28-Sep-2014 23:44, Uranuz пишет:
I totally agree with all of that.

It's one of those cases where correct by default is far too slow (that
would have to be graphemes) but fast by default is far too broken.
Better to force an explicit choice.

There is no magic bullet for unicode in a systems language such as D.
The programmer must be aware of it and make choices about how to treat
it.

I see didn't know about difference between byCodeUnit and
byGrapheme, because I speak Russian and it is close to English,
because it doesn't have diacritics. As far as I remember German,
that I learned at school have diacritics. So you opened my eyes
in this question. My position as usual programmer is that I
speaking language which graphemes coded by 2 bytes

In UTF-16 and UTF-8.

and I alwas
need to do decoding otherwise my programme will be broken. Other
possibility is to use wstring or dstring, but it is less memory
efficient. Also UTF-8 is more commonly used in the Internet so I
don't want to do some conversions to UTF-32, for example.

Where I could read about byGrapheme?

std.uni docs:
http://dlang.org/phobos/std_uni.html#.byGrapheme

Isn't this approach
overcomplicated? I don't want to write Dostoevskiy's book "War
and Peace" in order to write some parser for simple DSL.

It's Tolstoy actually:
http://en.wikipedia.org/wiki/War_and_Peace

You don't need byGrapheme for simple DSL. In fact as long as DSL is simple enough (ASCII only) you may safely avoid decoding. If it's in Russian you might want to decode. Even in this case there are ways to avoid decoding, it may involve a bit of writing in as for typical short novel ;)

In fact I did a couple of such literature exercises in std library.

For codepoint lookups on non-decoded strings:
http://dlang.org/phobos/std_uni.html#.utfMatcher

And to create sets of codepoints to detect with matcher:
http://dlang.org/phobos/std_uni.html#.CodepointSet

--
Dmitry Olshansky

Reply via email to