Hello Tibor,

I'm afraid things are not so easy, it seems to be a conflict with
Python stemmer.

> > For your use case, it would seem sufficient to add the new
> > characters as punctuation. You can experiment with
> > CFG_BIBINDEX_CHARS_PUNCTUATION and
> > CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS settings in your
> > instance and then use the above Python code to see the effects
> > without having to run full reindexing.
> 
> Thank you for this clear explanation.  Indeed, adding \«\»\¡\¿ to
> CFG_BIBINDEX_CHARS_PUNCTUATION seems to be enough:

After much fighting with Unicode errors and isolating the problem into
the smallest fragment, the issue seems to be with stemming.  However,
I still don't get it.

On a standard CFG_BIBINDEX_CHARS_* system:

~/invenio$ python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from invenio.bibindex_engine import *
>>> get_words_from_phrase('The man said: «¡Go, run!»', stemming_language='ca')
['said', 'run', 'run!\xc2\xbb', '\xc2\xab\xc2\xa1go', 'the', '\xc2\xbb', 'man']

On a system where has a single addition (\«) to both
CFG_BIBINDEX_CHARS_PUNCTUATION and
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, without stemming language,
its ok, but as soon I activate stemming (any language), I get the
deadly UnicodeDecodeError error, for example:

~/invenio$ python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from invenio.bibindex_engine import *
>>> get_words_from_phrase('The man said: «¡Go, run!»')
['\xa1go', 'said', 'run', 'run!\xc2\xbb', 'the', '\xbb', 'man']
>>> get_words_from_phrase('The man said: «¡Go, run!»', stemming_language='en')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ddd/lib/python/invenio/bibindex_engine.py", line 509, in 
get_words_from_phrase
    return words_tokenizer.tokenize(phrase)
  File "/home/ddd/lib/python/invenio/bibindex_engine_tokenizer.py", line 193, 
in tokenize
    stemmed_block = apply_stemming_and_stopwords_and_length_check(block, 
self.stemming_language)
  File "/home/ddd/lib/python/invenio/bibindex_engine_washer.py", line 101, in 
apply_stemming_and_stopwords_and_length_check
    word = stem(word, stemming_language)
  File "/home/ddd/lib/python/invenio/bibindex_engine_stemmer.py", line 65, in 
stem
    return _stemmers[get_ident()][lang].stemWord(word)
  File "Stemmer.pyx", line 192, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1988)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 0: invalid 
start byte

Googling around, I've found other users with this problem.  Myself,
I've tried several combinations of string encoding, unicoding and
related Python functions without any success.  The fact that both
CFG_BIBINDEX_CHARS_* are defined as raw strings in inveniocfg.py I
don't know if helps or not.

Python stemmer doesn't have any upstream update newer than Debian's
1.3.

I can imagine a simple solution converting those characters to their
simpler 8-bit equivalents (for example: «»“”„“ -> ", ¡-> !, etc. or
just a plain space character) somewhere after reading the string (in
get_words_from_phrase() and get_words_from_fulltext()), and then let
the word splitting work as usual.

Do you envisage a better solution?

Thanks,

Ferran

Reply via email to