Am Thu, 20 Mar 2014 01:55:08 +0400 schrieb Dmitry Olshansky <[email protected]>:
> Well, turns out the Unicode standard ties equivalence to normalization > forms. In other words unless both your strings are normalized the same > way there is really no point in trying to compare them. > > As for opaque type - we could have say String!NFC and String!NFD or > some-such. It would then make sure the normalization is the right one. And I thought of going the slow route where normalized and unnormalized strings can coexist and be compared. No NFD or NFC, just UTF8 strings. Pros: + Learning about normalization isn't needed to use strings correctly. And few people do that. + Strings don't need to be normalized. Every modification to data is bad, e.g. when said string is fed back to the source. Think about a file name on a file system where a different normalization is a different file. Cons: - Comparisons for already normalized strings are unnecessarily slow. Maybe the normalization form (NFC, NFD, mixed) could be stored alongside the string. > Cool, consider yourself enlisted :) > I reckon word and line breaking algorithms are piece of cake compared to > UCA. Given the power toys of CodepointSet and toTrie it shouldn't be > that hard to come up with prototype. Then we just move precomputed > versions of related tries to std/internal/ and that's it, ready for > public consumption. Would a typical use case be to find the previous/next boundary given a code unit index? E.g. the cursor sits on a word and you want to jump to the start or end of it. Just iterating the words and lines might not be too useful. > >> D (or any library for that matter) won't ever have all possible > >> tinkering that Unicode standard permits. So I expect D to be "done" with > >> Unicode one day simply by reaching a point of having all universally > >> applicable stuff (and stated defaults) plus having a toolbox to craft > >> your own versions of algorithms. This is the goal of new std.uni. > > > > Sorting strings is a very basic feature, but as I learned now > > also highly complex. I expected some kind of tables for > > download that would suffice, but the rules are pretty detailed. > > E.g. in German phonebook order, ä/ö/ü has the same order as > > ae/oe/ue. > > This is tailoring, an awful thing that makes cultural differences what > they are in Unicode ;) > > What we need first and furthermost DUCET based version (default Unicode > collation element tables). Of course. -- Marco
