Hello Ferran,

>> we have changed the default word tokenizer to properly account for
>> czech accents and now need to rebuild all the indexes. All went well
>> except that the global virtual index refuses to reindex.
>
> I'm unsure that the way to tackle this is in the word tokenizer;
> shouldn't it be done in the strip_accents funcion?  Some years ago I
> proposed to change its implementation:

Yes, we have actually changed the strip_accents function but the
result is that the tokenization has changed and the virtual global
index refuses to fully recognize this.

>  https://github.com/inveniosoftware/invenio/issues/425

Our change to strip_accents was a bit more opportunistic. We have just
added some more accented letters to the repertoire of regexps used
there and also added unicode normalization as the initial step there.

> I did not popose a patch because I don't know how to implement the
> tests.

Me either :(

Regards,

Petr

Reply via email to