On Tue, 06 May 2014, Petr Brož wrote: >> https://github.com/inveniosoftware/invenio/issues/425 > > Our change to strip_accents was a bit more opportunistic. We have just > added some more accented letters to the repertoire of regexps used > there and also added unicode normalization as the initial step there. > >> I did not popose a patch because I don't know how to implement the >> tests. > > Me either :(
On this accent stripping topic, I have an almost finished branch that should take care of ASCII'fication of Czech and many other languages properly out of the box. The only exceptions may be the CJK family of languages and Greek, for which opinions differ: https://github.com/inveniosoftware/invenio/issues/1675 Here is an example: In [1]: x = "Všichni lidé se rodí svobodní a sobě rovní " \ "co do důstojnosti a práv." In [2]: from invenio.textutils import strip_accents In [3]: strip_accents(x) 'Vsichni lide se rodi svobodni a sobe rovni co do dustojnosti a prav.' Best regards -- Tibor Simko
