Re: RFC unifying phrase search behaviour

Tibor Simko Wed, 26 Feb 2014 03:06:18 -0800

On Wed, 26 Feb 2014, Alexander Wagner wrote:
> If I get it
>
>       245:"Sonata Kreutzer"
>
> would not match. Right? (Any word in field kind of thing.)


Right.

> What defines "end of word"? I think about this ID-thingy: how many
> words are things like "P:(DE-Juel1)12345".

The definition of the "end of word" depends on the tokeniser that the
concrete index uses.

For an index consisting of IDs like this, you would use "exact
tokeniser" which would generate only one term, so there would be no word
splitting happening at all.  We do this already for several indexes,
e.g. compare "title" and "exacttitle" indexes:

   http://inspirehep.net/search?p=title%3A%22nuclear+electronics%22
   http://inspirehep.net/search?p=exacttitle%3A%22nuclear+electronics%22

If you would want to search for "P:(DE-Juel1)12345" in the MARC style,
say via a query like 100__0:"P:(DE-Juel1)12345", then the situation is
different, because for low-level direct MARC searching there are no word
pairs in the picture, and a direct lookup in bibxxx tables is happening
behind the scenes.  In this case the word boundaries would be best
defined by regexps, say "[[:<:]]stuff-people-typed-goes-here[[:>:]]".
So it would be a kind of combination of exact phrase match (since there
is no stemming etc) and partial phrase match (since we'd allow for
preceding or subsequent words).

> IMHO "*reutzer son*" would be an easier to remember syntax for mere
> mortals. Does this work as well?

Almost, the difference being that "*" is not hungry enough to eat white
space.  For example, see demo record 32 that contains:

  245__ $$aBasic nuclear electronics

and try the following queries on <http://invenio-demo.cern.ch/>:

  title:"nuclear electronics" ... hit
  title:"nucl* elec*" ... hit
  title:"basic nuclear electronics" ... hit
  title:"basic electronics" ... miss
  title:"bas* electronics" ... miss
  245:"bas* electronics" ... hit

It is the latter query's results that the current RFC proposes to
change.

IOW, title:"foo* bar*" means a two-word combination where the first word
starts with "foo" and the second word starts with "bar".  While in MARC
style, title:"foo* bar*" currently means exact values that start with
"foo", continues with any number of characters (white space included),
and continues with "bar".

>>    +-----------------------+-------------------+--------------------+
>>    | QUERY                 | CURRENT BEHAVIOUR | PROPOSED BEHAVIOUR |
>>    +-----------------------+-------------------+--------------------+
>>    | 245:'Kreutzer Sonata' | hit               | hit                |
>>    | 245:"Kreutzer Sonata" | miss              | hit                |
>
> I'm not sure about the hit here in the new version.

This is what title:"Kreutzer Sonata" returns, and this is what people
are used to seeing on Google and friends.  We simply plan to generalise
this behaviour to all indexes and to all MARC-style queries as well.

>>    | 245:'reutzer son'     | hit               | miss               |
>>    | 245:"reutzer son"     | miss              | miss               |
>>    | 245:/reutzer son/     | hit               | hit                |
>>    +-----------------------+-------------------+--------------------+
>>
>> Note that proposed behaviour is already the case for some logical
>> indexes such as "title" in Invenio v1.1 release series and above.
>
> I found that Invenio is doing fancy stuff in certain fields (author
> seems to be very special...)

Yes, the "author" index uses a special fuzzy tokeniser, so that for an
author named "Ellis, Jonathan Richard" people can type "John Ellis" and
still get a hit, not a miss.  For librarian style queries though, there
is an "exactauthor" index that behaves stricter here.

> Still, but this is a feeling, I'm not sure that giving up "exact
> match" type searches is a good idea.

In my eyes, it is not giving it up, it is more (i) advocating the use of
proper tokenisers on various indexes: sometimes exact, sometimes fuzzy,
etc; as well as (ii) harmonising behaviour between logical queries using
index names and physical queries using MARC tags.

>>> if you map "sid:(DE-HGF)1" to the old 'sid:(DE-HGF)1' it matches also
>>> "sid:(DE-HGF)11", which is wrong and not intended.
>>
>> Nope, it would not be mapped that way, see above.  The ID matching would
>> remain safe.
>
> So word ends are white spaces? Or is it that "" does not use
> permutations?

Yes, word boundaries are essentially white spaces, at least for
MARC-style queries.  (For regular indexes, the behaviour can be
configured for every index differently, depending on the tokeniser
used.)

Yes, the word order is respected when matching, the permutations would
not be considered a phrase match.

Best regards
-- 
Tibor Simko

Re: RFC unifying phrase search behaviour

Reply via email to