Hi Alexander,
On 04/11/2012 03:09 PM, Alexander Wagner wrote:
On 11.04.2012 14:05, Ludmila Marian wrote:
Hello Ludmilla!
perform_request_search(p='plo', cc='People')
>
This will return all the records that contain the word 'plo' in any of
the fields.
So the query to the db would be like: select <something> from
<all_field_index> where value='plo';
This is what I would expect except probably the notion of "word". Ie. I
would expect that simple search uses substring. I understand that this
is not the case. Right?
Right. Is similar with google (I think :-) ) (and in case of no results
found, some nearest terms are presented)
perform_request_search(p1='plo', m1='r', f1='author', cc='People')
This indeed is more restrictive since it searches only the author index
but is more broad because is doing a REGEXP search.
Though I do not use any regexping here.
m1='r' means regular expression search, so the Invenio search engine
automatically transforms it in a REGEXP search (even if you are not
using any regexp syntax).
The query in this case would be: select <something> from <author_index>
where value REGEXP 'plo';
That's what I understood. From a feeling I would guess that this is the
same as searching for 'plo' in the first case...
and this will match also the words that contain 'plo' as a substring (so
'fooplobar' would be a match) - as when doing a substring/phrase search.
... as I did NOT search for .*plo.*
I understand it works like the match operator, right? Something like
hit = 1 if str =~ m/plo/;
in perlspeak. So, in selecting regexp search I automagically win
"left/right truncation" to the search string which itself is handled as
phrase, right? Something like "a b c" in regexp search would search for
.*a b c.* (again something like =~ m/a b c/) and not for "a or b or c"
in simple search?
Exactly. This is the default behavior of REGEXP operator in mysql. About
the phrase vs string search you are again right. All the m='r' searches
are done on the phrase index (and not the word index). (FYI (in
connection with the perl syntax): the syntax for instructing the search
engine to perform a regular expression search when using the simple
search interface is /search_query/ more details and examples on this:
<http://invenio-demo.cern.ch/help/search-guide#regexp>)
Or the other way round: to mimic simple search via regexp in my first
example I would have had to search \bplo\b?
This might be a bit of overkill for the search engine, since doing a
regexp search is quite heavy for the system. The way I see it (but this
is my personal opinion) regex search should be used in cases where one
needs very complicated queries (ex: search for all the records that have
titles starting with Foo and ending with Bar, and do not contain any
numbers and have at most 3 words.. or something that has more sense :-)
). On the other hand, word search is very easy for the system, ( since
we are doing a lot of pre-processing at indexing time). So searching for
all records that contain the word plo would be either:
perform_request_search(p1="plo", m1="a") #advanced search interface
(m1='a' means 'all of the words')
or
perform_request_search(p="plo") #simple search
For retrieving all the words that contain 'plo' as a substring, the
possibilities are:
perform_request_search(p=" 'plo' ") #simple search (encapsulating your
search query in single quotes(' ') means substring match, while double
quotes (" ") means exact phrase search.. but this is in general a bit
confusing for people so we will probably drop this behavior in the near
future - there is already a branch waiting integration for this)
or
perform_request_search(p="/plo/") #simple search that will be translated
into regex
or
perform_request_search(p1="plo", m1='p') # advanced search + Partial
Phrase (substring match)
or
perform_request_search(p1="plo", m1='r') # advance search + regexp
I would assume, that the simple search gives at least as
many results than the more complex and in fact restricted
(I'm searching only in index 'authors') query. However, the
first one yields 0 results, while the second one gives me 8
hits.
I think if you would do the same type of search (m='a' or m='r') in both
cases, you would see the behavior that you would expect (more results
when doing simple search) otherwise m='r' will probably yield more
results then m='a' in most of the cases even if you are searching on a
smaller space.
I wonder if this is intuitive from an end users perspective. Going to
simple search in the first place is usually someone with the notion "oh,
it's like google, I like that". So wouldn't she suspect to have all this
autmagic truncations to happen? In a way this was what I fell for in my
simplistic approach. I always used regexp in all other parts but for
whatever reason used simple in this single application with the notion:
oh, I don't need real regexp, substring in all fields is just fine,
probably a bit to broad but given that collection it does no harm.
My impression is that Google by default does word search and not
substring search. I see on their advance search page that they have as
the default option 'all these words' .. but I can't find something clear
on this.
I think having by default magic truncation would introduce a lot of
noise. We have a bit of right truncation (because we are using stemming)
so, when searching for something, you will find also the plural & other
forms, but for having more than that, the user needs to specify it in
the query.
Some of the above might be better explain in the search guide
<http://invenio-demo.cern.ch/help/search-guide> although, as I was
saying, we might change some things in the future regarding
simple/double quotes searches.
Best regards,
Ludmila
--
Ludmila Marian ** CERN Document Server **<http://cds.cern.ch/>