Re: Proposal for fixing dchar ranges

Dmitry Olshansky Tue, 18 Mar 2014 12:21:48 -0700

18-Mar-2014 10:21, Marco Leise пишет:

The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / Ｆｕｌｌｗｉｄｔｈ, ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

(I used the "unfonts" package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks(NSString). Industry in general kinda sucks on this point butdesperately wants to improve.


For example when we talk about the length of a string we are
actually talking about 4 different things:

   - number of code units
   - number of code points
   - number of user perceived characters
   - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
   - What normalization do D strings use. Both Linux and
     MacOS X use UTF-8, but the binary representation of non-ASCII
     file names is different.


There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.

IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).

   - How do we handle sorting strings?


Unicode collation algorithm and provide ways to tweak the default one.

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs.


Well, I did. You seem motivated, would you like to join the group?

They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.

The 6.x standard line seems pretty stable to me. There is a point inprovding support that worth approaching. After that ROI is droopingsteadily as the amount of work to specialize for each specific culturerises. At some point we can only talk about opening up ways to specialize.

D (or any library for that matter) won't ever have all possibletinkering that Unicode standard permits. So I expect D to be "done" withUnicode one day simply by reaching a point of having all universallyapplicable stuff (and stated defaults) plus having a toolbox to craftyour own versions of algorithms. This is the goal of new std.uni.



--
Dmitry Olshansky

Re: Proposal for fixing dchar ranges

Reply via email to