#219: BibIndex: Should first names be indexed even in fuzzy?
-----------------------+----------------------------------------------------
Reporter: jblayloc | Owner: simko
Type: defect | Status: assigned
Priority: major | Milestone:
Component: BibIndex | Version:
Resolution: | Keywords: Invenio Syntax NEWS
-----------------------+----------------------------------------------------
Comment (by tbrooks):
There was some discussion among INSPIRE directors and I think the
consensus was as follows (quoting/paraphrasing from Tibor's mail:
find-author-jane and author:jane behaving the same way
This seems the best, however it leaves one hole in that someone using
Google syntax (i.e. no keywords in search) might type "Jane" and hope to
get all of Jane Austen's works, as one would in google. However this
will not work here, unless we build a separate first name index solely for
any field use (we could do this later)
For now we go for the find author Jane and author:Jane behaving the same.
If we go for this
technique, then the proper solution would be:
* We should kill the word index for authors, and use only one type of
index for any word/phrase/regexp queries. Actually, I have already
done that for the journal index, so the author index would not be
alone in this respect. (Which adds some weight to the pro camp.)
We just have to make sure to nicely document what we do with every
index, and what any-field index actually means.
* We should alter fuzzy author name tokenizer to generate not only
various firstname lastname name combinations, but also plain
lastname words. For example, we currently do (in the master
branch):
In [6]: t.tokenize('Peskin, Michael')
Out[6]: ['M Peskin', 'Michael Peskin', 'Peskin, M', 'Peskin, Michael']
We should add the family name to make it findable:
In [6]: t.tokenize('Peskin, Michael')
Out[6]: ['Peskin', 'M Peskin', 'Michael Peskin', 'Peskin, M', 'Peskin,
Michael']
* We should make sure that we are treating well all the interesting
cases such as Asian names (Su Yong Hong vs Hong Su Yong) and
composite names (Le Meur, Jean-Yves).
Implementation-wise, Joe could
take care of the latter two points, and Tibor could take care of the
former
point, as we did when plugging the author fuzzy tokenizer.
--
Ticket URL: <http://invenio-software.org/ticket/219#comment:7>
Invenio <http://invenio-software.org>