Am 17.01.2011 04:38, schrieb Daniel Gibson:
Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
On 1/16/11 6:42 PM, Daniel Gibson wrote:
Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
On 1/16/11 3:20 PM, Michel Fortin wrote:
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
<[email protected]> said:
But most strings don't contain combining characters or unnormalized
strings.
I think we should expect combining marks to be used more and more as our
OS text system and fonts start supporting them better. Them being rare
might be true today, but what do you know about tomorrow?
I don't think languages will acquire more diacritics soon. I do hope, of
course, that D applications gain more usage in the Arabic, Hebrew etc.
world.
So why does D use unicode anyway?
If you don't care about not-often used languages anyway, you could have
used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
which encoding he wants/needs).
You could as well say "we don't need to use dchar to represent a proper
code point, wchar is enough for most use cases and has fewer overhead
anyway".
I consider UTF8 superior to all of the above.
Really? UTF32 - maybe. But IMHO even when not considering graphemes and such
UTF8 sucks hard in comparison to those because one code point consists of 1-4
code units (even in German 1-2 code units).
I think it's reasonable to understand why I'm happy with the current
state of affairs. It is better than anything we've had before and
better than everything else I've tried.
It is indeed easy to understand why you're happy with the current state
of affairs: you never had to deal with multi-code-point character and
can't imagine yourself having to deal with them on a semi-frequent
basis.
Do you, and can you?
Other people won't be so happy with this state of affairs, but
they'll probably notice only after most of their code has been written
unaware of the problem.
They can't be unaware and write said code.
Fun fact: Germany recently introduced a new ID card and some of the
software that was developed for this and is used in some record sections
fucks up when a name contains diacritics.
I think especially when you're handling names (and much software does, I
think) it's crucial to have proper support for all kinds of chars.
Of course many programmers are not aware that, if Umlaute and ß works it
doesn't mean that all other kinds of strange characters work as well.
Cheers,
- Daniel
I think German text works well with dchar.
Yes, but even in Germany there are people whose names contain "strange"
characters ;)
Is it common to have programs that deal with text in a specific language but not
with names?
I do understand your resistance to support Unicode properly - it's a lot of
trouble and makes things inefficient (more inefficient than UTF8/16 already are
because of that code point != code unit thing).
Another thing is that due to bad support from fonts or console/GUI technology it
may happen (quite often) that one grapheme is *not* displayed as a single
character, thus messing up formatting anyway (Still you probably should cut a
string within a grapheme).
I meant you should *not* cut a string within a grapheme.
So here's what I think can be done (and, at least the first two points,
especially the first, should be done):
1. Mention the Grapheme and Digraph situation in string related documentation
(std.string and maybe string-related stuff in std.algorithm like Splitter) to
make sure people who use Phobos are aware of the problem. Then at least they
can't say that nobody told them when their Objective-C using colleagues are
laughing at their broken unicode-support ;)
2. Maybe add some functions that *do* deal with this.
Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can
check themselves, if they just split their string within a grapheme or
something.
3. Include a proper Unicode-string type/module, if somebody has the time and
knowledge to develop one. spir already started something like that AFAIK and
Steven Schveighoffer also is even working on a complete string type - maybe
these efforts could be combined?
I guess default strings will stay mostly the way they are (but please add an
ASCII type or allow ubyte[] asciiStr = "asdf";).
Having an additional type in Phobos that works correctly in all cases (e.g.
Arabic, Hebrew, Japanese, ..) would be really great, though.
UniString uStr = new UniString("sdfüñẫ");
UniString uStr2 = uStr[3..$]; // "üñẫ"
UniGraph ug = uStr[5]; // 'ẫ'
size_t i = uStr2.length; // 3
of course I forgot:
string s = uStr2.toString();
dstring s2 = uStr2.toDString();
to convert it back to a "normal" string
something like that maybe (of course plus a lot of other stuff like proper
comparison for different encodings of the same char like a modified icmp()
discussed before).
But something like
size_t len = uniLen("sdfüñẫ"); // 6
string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
etc may be just as good.
(I hope this all made sense)
Andrei
Cheers,
- Daniel