On 26.02.2014 12:05, Tibor Simko wrote:

Hi!

[...]
What defines "end of word"? I think about this ID-thingy:
how many words are things like "P:(DE-Juel1)12345".

The definition of the "end of word" depends on the
tokeniser that the concrete index uses.

I think here's the missing link I need to learn about.

For an index consisting of IDs like this, you would use
"exact tokeniser" which would generate only one term, so
there would be no word splitting happening at all.

I think this would solve the issue, indeed. I was not aware
that I can hook up a specific tokenizer to an index. I see
in our 1.0 that there's some magic happening with authors,
but it looked always a bit hard coded "just for authors".

If you would want to search for "P:(DE-Juel1)12345" in the
MARC style, say via a query like
100__0:"P:(DE-Juel1)12345", then the situation is
different, because for low-level direct MARC searching
there are no word pairs in the picture, and a direct
lookup in bibxxx tables is happening behind the scenes.

So it would always be an exact match type query, right?

While if I use aid as a logical field I could (somehow) add
a tokenizer to it's index that tells the indexer: this has
to be taken literally.

I think I need a lesson on that.

[...]
IOW, title:"foo* bar*" means a two-word combination where
the first word starts with "foo" and the second word
starts with "bar".  While in MARC style, title:"foo* bar*"
currently means exact values that start with "foo",
continues with any number of characters (white space
included), and continues with "bar".

    +-----------------------+-------------------+--------------------+
    | QUERY                 | CURRENT BEHAVIOUR | PROPOSED BEHAVIOUR |
    +-----------------------+-------------------+--------------------+
    | 245:'Kreutzer Sonata' | hit               | hit                |
    | 245:"Kreutzer Sonata" | miss              | hit                |

I'm not sure about the hit here in the new version.

This is what title:"Kreutzer Sonata" returns, and this is what people
are used to seeing on Google and friends.  We simply plan to generalise
this behaviour to all indexes and to all MARC-style queries as well.

I think there's a missunderstanding. I had this case in mind
with my funny Id searches. And matching 1 and 11 and 111 in
one go is not desirable there ;)

[...]
I found that Invenio is doing fancy stuff in certain
fields (author seems to be very special...)

Yes, the "author" index uses a special fuzzy tokeniser, so
that for an author named "Ellis, Jonathan Richard" people
can type "John Ellis" and still get a hit, not a miss.

I always wondered where I could see that FNT this one is
hooked up with this index.

For librarian style queries though, there is an
"exactauthor" index that behaves stricter here.

Ic. This would, however, then require an explicit
"exact"-index for all fields that should get the ability for
exact searches. At least if one needs a combined index for
them as the data could come in from more than one field.

Still, but this is a feeling, I'm not sure that giving up
"exact match" type searches is a good idea.

In my eyes, it is not giving it up, it is more (i)
advocating the use of proper tokenisers on various indexes

I see your point now. As said: I missed that I can add
different tokenizers for various fields/indicees. Where
should I find that? Or shouldn't I in 1.0?

if you map "sid:(DE-HGF)1" to the old 'sid:(DE-HGF)1' it matches also
"sid:(DE-HGF)11", which is wrong and not intended.

Nope, it would not be mapped that way, see above.  The ID matching would
remain safe.

So word ends are white spaces? Or is it that "" does not use
permutations?

Yes, word boundaries are essentially white spaces, at least for
MARC-style queries.  (For regular indexes, the behaviour can be
configured for every index differently, depending on the tokeniser
used.)

Now I understood that.

Yes, the word order is respected when matching, the
permutations would not be considered a phrase match.

Agree. I was just wondering if you want to add something
like "search those words in this field", and I'd not map
this to "" aka phrase search. Though it can be helpful for
general indicees. "Topic"  in Web of Science is like that:
combined of title + abstract + keywords and adding words to
topic search just uses any of them.

--

Kind regards,

Alexander Wagner
Scientific Services / Scientific Publishing
Central Library
52425 Juelich

mail : [email protected]
phone: +49 2461 61-1586
Fax  : +49 2461 61-6103
http://www.fz-juelich.de/zb/wp


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to