#219: BibIndex: Should first names be indexed even in fuzzy?
-----------------------+----------------------------------------------------
  Reporter:  jblayloc  |       Owner:  simko              
      Type:  defect    |      Status:  assigned           
  Priority:  major     |   Milestone:                     
 Component:  BibIndex  |     Version:                     
Resolution:            |    Keywords:  Invenio Syntax NEWS
-----------------------+----------------------------------------------------

Comment (by tbrooks):

 There was some discussion among INSPIRE directors and I think the
 consensus was as follows (quoting/paraphrasing from Tibor's mail:

 find-author-jane and author:jane behaving the same way

 This seems the best, however it leaves one hole in that someone using
 Google syntax (i.e. no keywords in search) might type "Jane" and hope to
 get all of Jane Austen's works, as one would in google.   However this
 will not work here, unless we build a separate first name index solely for
 any field use (we could do this later)

 For now we go for the find author Jane and author:Jane behaving the same.
 If we go for this
 technique, then the proper solution would be:

  * We should kill the word index for authors, and use only one type of
    index for any word/phrase/regexp queries.  Actually, I have already
    done that for the journal index, so the author index would not be
    alone in this respect.  (Which adds some weight to the pro camp.)
    We just have to make sure to nicely document what we do with every
    index, and what any-field index actually means.

  * We should alter fuzzy author name tokenizer to generate not only
    various firstname lastname name combinations, but also plain
    lastname words.  For example, we currently do (in the master
    branch):

     In [6]: t.tokenize('Peskin, Michael')
     Out[6]: ['M Peskin', 'Michael Peskin', 'Peskin, M', 'Peskin, Michael']

    We should add the family name to make it findable:

     In [6]: t.tokenize('Peskin, Michael')
     Out[6]: ['Peskin', 'M Peskin', 'Michael Peskin', 'Peskin, M', 'Peskin,
 Michael']

  * We should make sure that we are treating well all the interesting
    cases such as Asian names (Su Yong Hong vs Hong Su Yong) and
    composite names (Le Meur, Jean-Yves).

 Implementation-wise, Joe could
 take care of the latter two points, and Tibor could take care of the
 former
 point, as we did when plugging the author fuzzy tokenizer.

-- 
Ticket URL: <http://invenio-software.org/ticket/219#comment:7>
Invenio <http://invenio-software.org>

Reply via email to