18-Mar-2014 10:21, Marco Leise пишет:
The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / Fullwidth, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

(I used the "unfonts" package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks (NSString). Industry in general kinda sucks on this point but desperately wants to improve.


For example when we talk about the length of a string we are
actually talking about 4 different things:

   - number of code units
   - number of code points
   - number of user perceived characters
   - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
   - What normalization do D strings use. Both Linux and
     MacOS X use UTF-8, but the binary representation of non-ASCII
     file names is different.

There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.

IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).

   - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?

They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.

The 6.x standard line seems pretty stable to me. There is a point in provding support that worth approaching. After that ROI is drooping steadily as the amount of work to specialize for each specific culture rises. At some point we can only talk about opening up ways to specialize.

D (or any library for that matter) won't ever have all possible tinkering that Unicode standard permits. So I expect D to be "done" with Unicode one day simply by reaching a point of having all universally applicable stuff (and stated defaults) plus having a toolbox to craft your own versions of algorithms. This is the goal of new std.uni.


--
Dmitry Olshansky

Reply via email to