#313: compound last name search differences
------------------------+---------------------------------------------------
  Reporter:  tbrooks    |       Owner:                
      Type:  defect     |      Status:  new           
  Priority:  minor      |   Milestone:                
 Component:  WebSearch  |     Version:                
Resolution:             |    Keywords:  INSPIRE Syntax
------------------------+---------------------------------------------------

Comment (by jblayloc):

 This could be as simple as which way the metadata is written.  If a name
 is written forwards, then middle terms are assumed to be middle names.  If
 it's catalogued lastname-first, we trust the cataloguer to correctly
 identify the last name.

 Last names are never mangled; middle names are.

 Refer to the name expansion code
 (modules/bibindex/lib/bibindex_fuzzy_name_tokenizer.py:213-264) and this
 example:
 {{{
 In [2]: fnt = bet.BibIndexFuzzyNameTokenizer() # bet is the
 bibindex_engine_tokenizer.py

 In [3]: fnt.parse_scanned(fnt.scan('david foncesca mota'))
 Out[3]:
 ['d f mota',
  'd foncesca mota',
  'd mota',
  'david f mota',
  'david foncesca mota',
  'david mota',
  'f mota',
  'foncesca mota',
  'mota',
  'mota, d',
  'mota, d f',
  'mota, d foncesca',
  'mota, david',
  'mota, david f',
  'mota, david foncesca',
  'mota, f',
  'mota, foncesca']

 In [4]: fnt.parse_scanned(fnt.scan('foncesca mota, david'))
 Out[4]:
 ['d foncesca',
  'd foncesca mota',
  'd mota',
  'david foncesca',
  'david foncesca mota',
  'david mota',
  'foncesca',
  'foncesca mota',
  'foncesca mota, d',
  'foncesca mota, david',
  'foncesca, d',
  'foncesca, david',
  'mota',
  'mota, d',
  'mota, david']
 }}}

 Regardless of which way it went on the paper, AFAIK searching for 'd mota'
 will always find his papers (but of course it may not find only his
 papers.  For this, a disjunction of forms would probably do.)  I actually
 view it as a data mistake, if foncesca is part of a compound last name,
 that we've indexed him as david mota in the past.

 I'm happy to take suggestions, but it's not obvious to me how we should
 make guesses if the data is incomplete or wrong.  I don't know that doing
 the extra expansion in search_engine_query_parser is safe; it seems like
 it would break other middle-name/last-name distinctions for people with
 other preferences.

 I suppose we could make compound last names also index with the first
 letter of the first last name as if it were a middle name; this would be
 more consistent with spires' behavior, but sort of nonsensical otherwise.
 I also dislike it because I like to be able to say with certainty, "last
 names are never mangled".  And it still wouldn't help David, since we've
 actually treated his name inconsistently in the (meta)data.

 Apropos of nothing, 'find a mota, d f or a fonseca mota, d' finds all and
 only his papers.

 I actually do think that standardizing the metadata is the right thing, in
 this case and probably in general.  I suspect it's not so very broken for
 most people with compound last names, and I can't think of a code fix that
 doesn't break semantics for people for whom it's currently working
 correctly.

 Joe

 Replying to [comment:1 tbrooks]:
 > Note that in the below correspondence we see that SPIRES and INSPIRE
 behave differently on names like
 >
 > Foncesca Mota, David
 >
 > SPIRES appears not to care whether we put the 2nd name with the 3rd, or
 with the first (which surprised me!)
 >
 > INSPIRE does care.
 >
 > This may not be crucial, but it is important to many users.
 Standardizing the metadata is possible, but not ideal.

-- 
Ticket URL: <http://invenio-software.org/ticket/313#comment:2>
Invenio <http://invenio-software.org>

Reply via email to