Re: Search magic

Ludmila Marian Wed, 11 Apr 2012 07:36:48 -0700

Hi Alexander,


On 04/11/2012 03:09 PM, Alexander Wagner wrote:

On 11.04.2012 14:05, Ludmila Marian wrote:

Hello Ludmilla!

perform_request_search(p='plo', cc='People')

This will return all the records that contain the word 'plo' in any of
the fields.
So the query to the db would be like: select <something> from
<all_field_index> where value='plo';


This is what I would expect except probably the notion of "word". Ie. I
would expect that simple search uses substring. I understand that this
is not the case. Right?

Right. Is similar with google (I think :-) ) (and in case of no resultsfound, some nearest terms are presented)

perform_request_search(p1='plo', m1='r', f1='author', cc='People')


This indeed is more restrictive since it searches only the author index
but is more broad because is doing a REGEXP search.


Though I do not use any regexping here.

m1='r' means regular expression search, so the Invenio search engineautomatically transforms it in a REGEXP search (even if you are notusing any regexp syntax).

The query in this case would be: select <something> from <author_index>
where value REGEXP 'plo';


That's what I understood. From a feeling I would guess that this is the
same as searching for 'plo' in the first case...

and this will match also the words that contain 'plo' as a substring (so
'fooplobar' would be a match) - as when doing a substring/phrase search.


... as I did NOT search for .*plo.*

I understand it works like the match operator, right? Something like

       hit = 1 if str =~ m/plo/;

in perlspeak. So, in selecting regexp search I automagically win
"left/right truncation" to the search string which itself is handled as
phrase, right? Something like "a b c" in regexp search would search for
.*a b c.* (again something like =~ m/a b c/) and not for "a or b or c"
in simple search?

Exactly. This is the default behavior of REGEXP operator in mysql. Aboutthe phrase vs string search you are again right. All the m='r' searchesare done on the phrase index (and not the word index). (FYI (inconnection with the perl syntax): the syntax for instructing the searchengine to perform a regular expression search when using the simplesearch interface is /search_query/ more details and examples on this:

<http://invenio-demo.cern.ch/help/search-guide#regexp>)


Or the other way round: to mimic simple search via regexp in my first
example I would have had to search \bplo\b?

This might be a bit of overkill for the search engine, since doing aregexp search is quite heavy for the system. The way I see it (but thisis my personal opinion) regex search should be used in cases where oneneeds very complicated queries (ex: search for all the records that havetitles starting with Foo and ending with Bar, and do not contain anynumbers and have at most 3 words.. or something that has more sense :-)). On the other hand, word search is very easy for the system, ( sincewe are doing a lot of pre-processing at indexing time). So searching forall records that contain the word plo would be either:

perform_request_search(p1="plo", m1="a") #advanced search interface(m1='a' means 'all of the words')

or
perform_request_search(p="plo") #simple search

For retrieving all the words that contain 'plo' as a substring, thepossibilities are:

perform_request_search(p=" 'plo' ") #simple search (encapsulating yoursearch query in single quotes(' ') means substring match, while doublequotes (" ") means exact phrase search.. but this is in general a bitconfusing for people so we will probably drop this behavior in the nearfuture - there is already a branch waiting integration for this)

or

perform_request_search(p="/plo/") #simple search that will be translatedinto regex

or

perform_request_search(p1="plo", m1='p') # advanced search + PartialPhrase (substring match)

or
perform_request_search(p1="plo", m1='r') # advance search + regexp

I would assume, that the simple search gives at least as
many results than the more complex and in fact restricted
(I'm searching only in index 'authors') query. However, the
first one yields 0 results, while the second one gives me 8
hits.


I think if you would do the same type of search (m='a' or m='r') in both
cases, you would see the behavior that you would expect (more results
when doing simple search) otherwise m='r' will probably yield more
results then m='a' in most of the cases even if you are searching on a
smaller space.


I wonder if this is intuitive from an end users perspective. Going to
simple search in the first place is usually someone with the notion "oh,
it's like google, I like that". So wouldn't she suspect to have all this
autmagic truncations to happen? In a way this was what I fell for in my
simplistic approach. I always used regexp in all other parts but for
whatever reason used simple in this single application with the notion:
oh, I don't need real regexp, substring in all fields is just fine,
probably a bit to broad but given that collection it does no harm.

My impression is that Google by default does word search and notsubstring search. I see on their advance search page that they have asthe default option 'all these words' .. but I can't find something clearon this.I think having by default magic truncation would introduce a lot ofnoise. We have a bit of right truncation (because we are using stemming)so, when searching for something, you will find also the plural & otherforms, but for having more than that, the user needs to specify it inthe query.

Some of the above might be better explain in the search guide<http://invenio-demo.cern.ch/help/search-guide> although, as I wassaying, we might change some things in the future regardingsimple/double quotes searches.



Best regards,
Ludmila

--
Ludmila Marian ** CERN Document Server **<http://cds.cern.ch/>

Re: Search magic

Reply via email to