On 24.02.2014 11:30, Tibor Simko wrote:

Hi!

People don't easily distinguish between the following queries:

    title:'some phrase'

substring

    title:"some phrase"

exact search

[...]
    245:'some phrase'
    245:"some phrase"

so that single-quoted and double-quoted phrase queries would always
return the same result.

Which is then an exact match, right? So to get '' matches
one would use "*bla*", right?

What this change means for you:

1. The end users can use single-quoted or double-quoted queries to
    express phrase search, in all indexes.  There would be no difference.

2. The phrase search would be done by default via word pair matching,
    unless indexes are tokenised in a special manner (e.g. exact author
    name) or unless users search inside physical MARC tags (when no word
    pair index exists).

3. If you have relied on "partial phrase matching", please switch to
    regular expression queries like:

       245:/some phrase/
       245:/[[:blank:]]some phrase[[:blank:]]/


This should be 245:"*some phrase*"...

4. If you have relied on "exact phrase matching", please switch to
    regular expression queries like:

       245:/^Exact title.$/

This should be 245:"Exact title".

Sorry, if I ask here.

If I get this correctly, every /exact/ search, in old world
"bla" (no substring) would be a regular expression now, in
this new scheme, right?

This would IMHO /not/ be sensible at all.

First of all if I place bla explicitly in quotes I /expect/
it to be an exact match and not a substring, so it is
contraintuitive. See G (and friends): the only way to switch
off their "intelligence" is to put things explicitly in
quotes.

Secondly, it would mean that all our ID searches which are
"ID:(src)Number"-type things end up in really /expensive/
regexp searches.

I.e. we would regexp something like this:

http://juser.fz-juelich.de/search?p=%28collection%3A%22VDB%22+and+web%3A%222013%22%29+and+%28id%3A%22WOS%3A000%2A%22+or+sid%3A%22StatID%3A%28DE-HGF%290100%22+or+sid%3A%22StatID%3A%28DE-HGF%290110%22+or+sid%3A%22StatID%3A%28DE-HGF%290111%22+or+sid%3A%22StatID%3A%28DE-HGF%290120%22+or+sid%3A%22StatID%3A%28DE-HGF%290130%22%29+and+pof%3A%22G%3A%28DE-HGF%29POF2-110%22

Please holler if this change could badly break some of
your workflows.

If I get it correctly, it breaks almost all our bean
counting. IDs are something like

  sid:(DE-HGF)1

or

  sid:(DE-HGF)11

if you map "sid:(DE-HGF)1" to the old 'sid:(DE-HGF)1' it
matches also "sid:(DE-HGF)11", which is wrong and not
intended.

So, if one wants to unify quotes (I agree that
distinguishing '' vs "" is difficult to explain and
especially needs explanation) then one should unify it to
use quotes always for exact matches. One could then have
substring search by either leaving out the quotes or putting
explicit * oprators like

  "*bla*"

ie. let the '' behave like the "" in invenio logic but not
to have "" call sub string matches.

This is also something I can explain to the Normal User(tm),
while I think your regexps above are a bit beyond their
common language ;)

--

Kind regards,

Alexander Wagner
Scientific Services / Scientific Publishing
Central Library
52425 Juelich

mail : [email protected]
phone: +49 2461 61-1586
Fax  : +49 2461 61-6103
http://www.fz-juelich.de/zb/wp


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to