It still wouldn't be a bad idea to normalize to NFC before indexing, 
after you use Erik's recommended algorithm, just in case.  NFC is the 
one basically intended for making sure two strings are byte by byte 
equivalent if they represent the same characters. I think it'll 
normalize some things that will be left over even after stripping out 
per Erik's suggested algorithm.

Unicode is tricky.  This standards document is helpful for getting a 
handle on normalization, pretty clearly written: 
http://unicode.org/reports/tr15/

On 12/13/2010 5:07 PM, Erik Hetzner wrote:
> At Mon, 13 Dec 2010 13:57:59 -0800,
> Karen Coyle wrote:
>> I'm not sure this is the same problem, but here's a bug report:
>>      https://bugs.launchpad.net/openlibrary/+bug/389217
>> that covers at least some of the issues. I remember that this has come
>> up before, and may have something to do with Solr. Definitely, one
>> needs to be able to search on accented and unaccented characters with
>> the same query. Also, those of us with dumb ASCII keyboards find it
>> very hard to key accented characters, although we may want to search
>> on names with diacritics.
> FYI, an easy way in Python to strip (many) diacritics to allow us
> Americans to search the way we like, and the way our keyboards
> support:
>
>    import unicodedata
>    def strip_accents(s):
>        return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if 
> unicodedata.category(c)
> != 'Mn'))
>
> Normalize to NFD (decomposed) then strip all the mark, nonspacing
> characters. Obviously you want to index both the accented and stipped
> versions.
>
> best, Erik
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to