On Fri, 25 Oct 2013, [email protected] wrote:
> It seems that even in the demo site, BibIndex complains (but does not
> stop) in certain records that have authors with non-latin names.
Yes, it does, thanks for reporting the problem!
> sudo -u apache /opt/invenio/bin/bibindex -i 75 -w author -v9
>
> The warnings will NOT be outputted to the bibsced_task log file, but
> only shown on screen during execution.
Note that they are now logged in `invenio.err' as per ticket:1616.
> In case you cannot verify my findings, [...]
Yes, I can verify them. This is a bug in Invenio causing some UTF-8
strings to not be fully properly indexed. The error is ``not too bad''
because searching works, but it ``overdoes'' things and so should be
fixed.
Here is how one can easily reproduce it:
| In [70]: x = "Пушкин, А С"
|
| In [71]: from invenio.bibindex_tokenizers.BibIndexAuthorTokenizer import
BibIndexAuthorTokenizer
|
| In [72]: t = BibIndexAuthorTokenizer()
|
| In [73]: l = t.tokenize_for_words('Пушкин, А С')
|
| In [74]: print "\n".join(l)
| с
| а
| пушкин
So far so good, but:
| In [75]: l = t.tokenize_for_phrases('Пушкин, А С')
|
| In [78]: print "\n".join(l)
| . . Пушкин
| . Пушкин
| . С Пушкин
| А . Пушкин
| А Пушкин
| А С Пушкин
| Пушкин, .
| Пушкин, . .
| Пушкин, . С
| Пушкин, А
| Пушкин, А .
| Пушкин, А С
| Пушкин, С
| С Пушкин
See the `\320' that should not really be there, and see the extra terms
in the index that one does not really need.
I've just ticketised this:
http://invenio-software.org/ticket/1626
Best regards
--
Tibor Simko