19-Mar-2014 18:42, Marco Leise пишет:
Am Tue, 18 Mar 2014 23:18:16 +0400
schrieb Dmitry Olshansky <[email protected]>:
Related:
- What normalization do D strings use. Both Linux and
MacOS X use UTF-8, but the binary representation of non-ASCII
file names is different.
There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.
Normalizations C and D are the non lossy ones and as far as I
understood equivalent. So I agree.
Right, the KC & KD ones are really all about fuzzy matching and searching.
IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).
I wondered if anyone will actually read up on normalization
prior to touching Unicode strings. I didn't, Andrei didn't and
so on...
So I expect strA == strB to be common enough, just like floatA
== floatB until the news spread.
If that of any comfort other languages are even worse here. In C++ your
are hopeless without ICU.
Since == is supposed to
compare for equivalence, could we hide all those details in
an opaque string type and offer correct comparison functions?
Well, turns out the Unicode standard ties equivalence to normalization
forms. In other words unless both your strings are normalized the same
way there is really no point in trying to compare them.
As for opaque type - we could have say String!NFC and String!NFD or
some-such. It would then make sure the normalization is the right one.
- How do we handle sorting strings?
Unicode collation algorithm and provide ways to tweak the default one.
I wish I didn't look at the UCA. Jeeeez...
But yeah, that's the way to go.
Needless to say I had a nice jaw-dropping moment when I realized what
elephant I have missed with our std.uni (somewhere in the middle of the
work).
Big frameworks like Java added a Collate class with predefined
constants for several languages. That's too much work for us.
But the API doesn't need to preclude adding those.
Indeed some kind of Collator is in order. On the use side of things it's
simply a functor that compares strings. The fact that it's full of
tables and the like is well hidden. The only thing above that is caching
preprocessed strings, that maybe useful for databases and string indexes.
The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs.
Well, I did. You seem motivated, would you like to join the group?
Yes, I'd like to see a Unicode 6.x approved stamp on D.
I didn't know that you already wrote all the simple algorithms
for 2.064. Those would have been my candidates to work on, too.
Is there anything that can be implemented in a day or two? :)
Cool, consider yourself enlisted :)
I reckon word and line breaking algorithms are piece of cake compared to
UCA. Given the power toys of CodepointSet and toTrie it shouldn't be
that hard to come up with prototype. Then we just move precomputed
versions of related tries to std/internal/ and that's it, ready for
public consumption.
D (or any library for that matter) won't ever have all possible
tinkering that Unicode standard permits. So I expect D to be "done" with
Unicode one day simply by reaching a point of having all universally
applicable stuff (and stated defaults) plus having a toolbox to craft
your own versions of algorithms. This is the goal of new std.uni.
Sorting strings is a very basic feature, but as I learned now
also highly complex. I expected some kind of tables for
download that would suffice, but the rules are pretty detailed.
E.g. in German phonebook order, ä/ö/ü has the same order as
ae/oe/ue.
This is tailoring, an awful thing that makes cultural differences what
they are in Unicode ;)
What we need first and furthermost DUCET based version (default Unicode
collation element tables).
--
Dmitry Olshansky