On Wed, 26 Feb 2014, Alexander Wagner wrote: > If I get it > > 245:"Sonata Kreutzer" > > would not match. Right? (Any word in field kind of thing.)
Right. > What defines "end of word"? I think about this ID-thingy: how many > words are things like "P:(DE-Juel1)12345". The definition of the "end of word" depends on the tokeniser that the concrete index uses. For an index consisting of IDs like this, you would use "exact tokeniser" which would generate only one term, so there would be no word splitting happening at all. We do this already for several indexes, e.g. compare "title" and "exacttitle" indexes: http://inspirehep.net/search?p=title%3A%22nuclear+electronics%22 http://inspirehep.net/search?p=exacttitle%3A%22nuclear+electronics%22 If you would want to search for "P:(DE-Juel1)12345" in the MARC style, say via a query like 100__0:"P:(DE-Juel1)12345", then the situation is different, because for low-level direct MARC searching there are no word pairs in the picture, and a direct lookup in bibxxx tables is happening behind the scenes. In this case the word boundaries would be best defined by regexps, say "[[:<:]]stuff-people-typed-goes-here[[:>:]]". So it would be a kind of combination of exact phrase match (since there is no stemming etc) and partial phrase match (since we'd allow for preceding or subsequent words). > IMHO "*reutzer son*" would be an easier to remember syntax for mere > mortals. Does this work as well? Almost, the difference being that "*" is not hungry enough to eat white space. For example, see demo record 32 that contains: 245__ $$aBasic nuclear electronics and try the following queries on <http://invenio-demo.cern.ch/>: title:"nuclear electronics" ... hit title:"nucl* elec*" ... hit title:"basic nuclear electronics" ... hit title:"basic electronics" ... miss title:"bas* electronics" ... miss 245:"bas* electronics" ... hit It is the latter query's results that the current RFC proposes to change. IOW, title:"foo* bar*" means a two-word combination where the first word starts with "foo" and the second word starts with "bar". While in MARC style, title:"foo* bar*" currently means exact values that start with "foo", continues with any number of characters (white space included), and continues with "bar". >> +-----------------------+-------------------+--------------------+ >> | QUERY | CURRENT BEHAVIOUR | PROPOSED BEHAVIOUR | >> +-----------------------+-------------------+--------------------+ >> | 245:'Kreutzer Sonata' | hit | hit | >> | 245:"Kreutzer Sonata" | miss | hit | > > I'm not sure about the hit here in the new version. This is what title:"Kreutzer Sonata" returns, and this is what people are used to seeing on Google and friends. We simply plan to generalise this behaviour to all indexes and to all MARC-style queries as well. >> | 245:'reutzer son' | hit | miss | >> | 245:"reutzer son" | miss | miss | >> | 245:/reutzer son/ | hit | hit | >> +-----------------------+-------------------+--------------------+ >> >> Note that proposed behaviour is already the case for some logical >> indexes such as "title" in Invenio v1.1 release series and above. > > I found that Invenio is doing fancy stuff in certain fields (author > seems to be very special...) Yes, the "author" index uses a special fuzzy tokeniser, so that for an author named "Ellis, Jonathan Richard" people can type "John Ellis" and still get a hit, not a miss. For librarian style queries though, there is an "exactauthor" index that behaves stricter here. > Still, but this is a feeling, I'm not sure that giving up "exact > match" type searches is a good idea. In my eyes, it is not giving it up, it is more (i) advocating the use of proper tokenisers on various indexes: sometimes exact, sometimes fuzzy, etc; as well as (ii) harmonising behaviour between logical queries using index names and physical queries using MARC tags. >>> if you map "sid:(DE-HGF)1" to the old 'sid:(DE-HGF)1' it matches also >>> "sid:(DE-HGF)11", which is wrong and not intended. >> >> Nope, it would not be mapped that way, see above. The ID matching would >> remain safe. > > So word ends are white spaces? Or is it that "" does not use > permutations? Yes, word boundaries are essentially white spaces, at least for MARC-style queries. (For regular indexes, the behaviour can be configured for every index differently, depending on the tokeniser used.) Yes, the word order is respected when matching, the permutations would not be considered a phrase match. Best regards -- Tibor Simko

