On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii <e...@gnu.org> wrote:

> I think decomposition to NFKD solves these issues, doesn't it?
>

Not completely. Judging from your question, you expected more mappings than
NFKD has. You might want to try the mappings that are used as input for
deriving the DUCET (default Unicode collation):
http://www.unicode.org/Public/UCA/latest/decomps.txt

For a character-based search, you should still try to work with canonical
equivalence, for example by applying the FCD check and normalizing when
that fails. http://www.unicode.org/notes/tn5/

Thanks.  I've studied that already, and I do know that collation data
> can be used for search.  But it's still a lot of data that I'd like to
> avoid loading, if possible.
>

Sure, as I said, it depends on what you need and want.

FYI, the ICU data file corresponding to the DUCET is about 160kB (for UCA
7.0) and could be reduced if limited to one specific use case, but the
collation and string-search code is large and complex.

Best regards,
markus
_______________________________________________
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Reply via email to