On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii <e...@gnu.org> wrote:
> I think decomposition to NFKD solves these issues, doesn't it? > Not completely. Judging from your question, you expected more mappings than NFKD has. You might want to try the mappings that are used as input for deriving the DUCET (default Unicode collation): http://www.unicode.org/Public/UCA/latest/decomps.txt For a character-based search, you should still try to work with canonical equivalence, for example by applying the FCD check and normalizing when that fails. http://www.unicode.org/notes/tn5/ Thanks. I've studied that already, and I do know that collation data > can be used for search. But it's still a lot of data that I'd like to > avoid loading, if possible. > Sure, as I said, it depends on what you need and want. FYI, the ICU data file corresponding to the DUCET is about 160kB (for UCA 7.0) and could be reduced if limited to one specific use case, but the collation and string-search code is large and complex. Best regards, markus
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode