#313: compound last name search differences
------------------------+---------------------------------------------------
Reporter: tbrooks | Owner:
Type: defect | Status: new
Priority: minor | Milestone:
Component: WebSearch | Version:
Resolution: | Keywords: INSPIRE Syntax
------------------------+---------------------------------------------------
Comment (by jblayloc):
This could be as simple as which way the metadata is written. If a name
is written forwards, then middle terms are assumed to be middle names. If
it's catalogued lastname-first, we trust the cataloguer to correctly
identify the last name.
Last names are never mangled; middle names are.
Refer to the name expansion code
(modules/bibindex/lib/bibindex_fuzzy_name_tokenizer.py:213-264) and this
example:
{{{
In [2]: fnt = bet.BibIndexFuzzyNameTokenizer() # bet is the
bibindex_engine_tokenizer.py
In [3]: fnt.parse_scanned(fnt.scan('david foncesca mota'))
Out[3]:
['d f mota',
'd foncesca mota',
'd mota',
'david f mota',
'david foncesca mota',
'david mota',
'f mota',
'foncesca mota',
'mota',
'mota, d',
'mota, d f',
'mota, d foncesca',
'mota, david',
'mota, david f',
'mota, david foncesca',
'mota, f',
'mota, foncesca']
In [4]: fnt.parse_scanned(fnt.scan('foncesca mota, david'))
Out[4]:
['d foncesca',
'd foncesca mota',
'd mota',
'david foncesca',
'david foncesca mota',
'david mota',
'foncesca',
'foncesca mota',
'foncesca mota, d',
'foncesca mota, david',
'foncesca, d',
'foncesca, david',
'mota',
'mota, d',
'mota, david']
}}}
Regardless of which way it went on the paper, AFAIK searching for 'd mota'
will always find his papers (but of course it may not find only his
papers. For this, a disjunction of forms would probably do.) I actually
view it as a data mistake, if foncesca is part of a compound last name,
that we've indexed him as david mota in the past.
I'm happy to take suggestions, but it's not obvious to me how we should
make guesses if the data is incomplete or wrong. I don't know that doing
the extra expansion in search_engine_query_parser is safe; it seems like
it would break other middle-name/last-name distinctions for people with
other preferences.
I suppose we could make compound last names also index with the first
letter of the first last name as if it were a middle name; this would be
more consistent with spires' behavior, but sort of nonsensical otherwise.
I also dislike it because I like to be able to say with certainty, "last
names are never mangled". And it still wouldn't help David, since we've
actually treated his name inconsistently in the (meta)data.
Apropos of nothing, 'find a mota, d f or a fonseca mota, d' finds all and
only his papers.
I actually do think that standardizing the metadata is the right thing, in
this case and probably in general. I suspect it's not so very broken for
most people with compound last names, and I can't think of a code fix that
doesn't break semantics for people for whom it's currently working
correctly.
Joe
Replying to [comment:1 tbrooks]:
> Note that in the below correspondence we see that SPIRES and INSPIRE
behave differently on names like
>
> Foncesca Mota, David
>
> SPIRES appears not to care whether we put the 2nd name with the 3rd, or
with the first (which surprised me!)
>
> INSPIRE does care.
>
> This may not be crucial, but it is important to many users.
Standardizing the metadata is possible, but not ideal.
--
Ticket URL: <http://invenio-software.org/ticket/313#comment:2>
Invenio <http://invenio-software.org>