Am Tue, 18 Mar 2014 23:18:16 +0400 schrieb Dmitry Olshansky <[email protected]>:
> 18-Mar-2014 10:21, Marco Leise пишет: > > The Unicode standard is too complex for general purpose > > algorithms to do useful things on D strings. We don't see that > > however, since our writing systems are sufficiently well > > supported. > > > As an inspiration I'll leave a string here that contains > > combined characters in Korean > > (http://decodeunicode.org/hangul_syllables) > > and Latin as well as full width characters that span 2 > > characters in e.g. Latin, Greek or Cyrillic scripts > > (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms): > > > > Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊ > > > > (I used the "unfonts" package for the Hangul part) > > > > What I want to say is that for correct Unicode handling we > > should either use existing libraries or get a feeling for > > what the Unicode standard provides, then form use cases out of it. > > There is ICU and very few other things, like support in OSX frameworks > (NSString). Industry in general kinda sucks on this point but > desperately wants to improve. > > > For example when we talk about the length of a string we are > > actually talking about 4 different things: > > > > - number of code units > > - number of code points > > - number of user perceived characters > > - display width using a monospace font > > > > The same distinction applies for slicing, depending on use case. > > > > Related: > > - What normalization do D strings use. Both Linux and > > MacOS X use UTF-8, but the binary representation of non-ASCII > > file names is different. > > There is no single normalization to fix on. > D programs may be written for Linux only, for Mac-only or for both. Normalizations C and D are the non lossy ones and as far as I understood equivalent. So I agree. > IMO we should just provide ways to normalize strings. > (std.uni.normalize has 'normalize' for starters). I wondered if anyone will actually read up on normalization prior to touching Unicode strings. I didn't, Andrei didn't and so on... So I expect strA == strB to be common enough, just like floatA == floatB until the news spread. Since == is supposed to compare for equivalence, could we hide all those details in an opaque string type and offer correct comparison functions? > > - How do we handle sorting strings? > > Unicode collation algorithm and provide ways to tweak the default one. I wish I didn't look at the UCA. Jeeeez... But yeah, that's the way to go. Big frameworks like Java added a Collate class with predefined constants for several languages. That's too much work for us. But the API doesn't need to preclude adding those. > > The topic matter is complex, but not difficult (as in rocket science). > > If we really want to find a solution, we should form an expert group > > and stop talking until we read the latest Unicode specs. > > Well, I did. You seem motivated, would you like to join the group? Yes, I'd like to see a Unicode 6.x approved stamp on D. I didn't know that you already wrote all the simple algorithms for 2.064. Those would have been my candidates to work on, too. Is there anything that can be implemented in a day or two? :) > > They are a > > moving target. Don't expect to ever be "done" with full Unicode > > support in D. > > The 6.x standard line seems pretty stable to me. There is a point in > provding support that worth approaching. After that ROI is drooping > steadily as the amount of work to specialize for each specific culture > rises. At some point we can only talk about opening up ways to specialize. > > D (or any library for that matter) won't ever have all possible > tinkering that Unicode standard permits. So I expect D to be "done" with > Unicode one day simply by reaching a point of having all universally > applicable stuff (and stated defaults) plus having a toolbox to craft > your own versions of algorithms. This is the goal of new std.uni. Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, ä/ö/ü has the same order as ae/oe/ue. -- Marco
