Hi Alexander!


On 01/29/2013 12:53 PM, Alexander Wagner wrote:
I have proposed
a patch for this, for maint-1.1: a message will be displayed saying that
the boolean operation resulted in no hits, and we will print each of the
individual queries with their number of results (so the user can choose
to go only for a part of his query) - similar with the response obtained
from simple search. The patch will be integrated shortly.

This sounds resonable.

However, please keep one point in mind: "did you mean ... I'll display
those results even if you didn't" like query handling is great for
"discovery like" searches. Ie. "I don't know exactly what I'm looking
for" (usually, this is triggered by some carbon based life form on
OSI-layer 8) so displaying some area that is more likely and produces
hits is helpful.

However, in many scenarios there is also a need for exact matches that
do NOT dispaly something else if the result is zero.

To give an example: during migration we matched our old datasets against
a number of external sources like pubmed, arXiv, inspire and so on.
Usually, we used the DOI as key for this input. However, for some DOIs
pubmed finds it helpful to return exactly one record that is entirely
unrelated due to a "did you mean" expansion. (As long as I got two or
more hits I disregared the match.) This gave me a bunch of wrong
associations though I had a precise input parameter. As precise as a DOI
can be.

Note also that if you strip of / . <whatever> from a DOI like entity you
might produce a dupe that didn't exist with those in place. I stumbled
upon this cause we implemented a basic dupe detection on websubmit that
just searches the local database by the doi/pmid/arXiv whatever we got
as input. Usually, I used the string as such and did an "all fields" but
this triggered wrong results due to the stripping. (I re-coded this part
of the code search in field 0247_$a and 773__$a only and "" it. Now it
seems fine.)

Similar use case: JuSER feeds our web pages. Usually, these are searches
like: (cid:"inst-ID" and typ:"doctypeID" and pub:"year") and require
"return exactly". However, at the beginning of the year most of these
searches will yield empty results for some time and if the search
algorithm throws in a "did you mean" and returns say the results from
institute A but those for institute B just "cause there is something"
(did you mean?) in A but not in B this will cause trouble.

Indeed, problems might arise due to nearest search terms, but the suggestions are mostly for web-interface users, to guide them to a better search. Any other outputs, except the html-based ones, will return an empty list. Actually, there are some ways you can instruct the search engine to do only exact search, and no nearest searches. One way is to use the the 'ap' parameter - alternative pattern (you can test it's behaviour by adding &ap=0 or &ap=1)

ap - alternative patterns (0=no, 1=yes).  In case no exact
                     match is found, the search engine can try alternative
                     patterns e.g. to replace non-alphanumeric characters by
                     a boolean query.  ap defines if this is wanted.

Another option, but this concerns only the display, is to customise the CFG_WEBSEARCH_DISPLAY_NEAREST_TERMS to control if any suggestions will be displayed or not.

In any case, the patch that I proposed for the advanced search issue, is not proposing nearest searches ('did you mean') nor is doing any query manipulation in order to get some results. It behaves exactly as in this case:
<http://cds.cern.ch/search?p=author%3Am%C3%BCller+and+author%3Awert>
(so just pointing out that there are individual results, but the boolean operation returned none: in this case the user will know for sure that his search patters are correct, it's just that there are no papers written by both müller and wert).


Cheers,
Ludmila



Regarding the advanced search being treated as simple search: this is a
very good question :-)
[...]
complicated query in the simple search - in this way they will also
'see' how the query is formed (what we use for regex search or exact
phrase search, etc.).

This will be great. If it is available soon I'll luckily wait for it. :)
If its a major task I'd suggest to rewrite the internals of adv search
(perform_request_search I think) to just rebase it to a simple search
with the same logic. I think this also solves Ferrans observations.

I am currently re-basing this branch and preparing it for integration to
Invenio (and also deployment to CDS, so you can see it 'in action' in a
few days.)

:) Seems like to be available shortly so we might win it if we move up
to 1.x (with x > 0). This upgrade will likely happen as soon as the
current evaluation period is done GSI, DESY and probably RWTH are up and
running as well. I dare not touch JuSER during this evaluation stuff.
This has to do with the simple fact that we go ~1100 websubmits to JuSER
in 2012 and another ~1200 in 2013 till now. @Samuele: that is the main
reason why I do not just apply a 1.1 update to fix the OAI server. If I
break something now I can set up my tent here on campus. Currently,
amongst others, a bit cold for this in Jülich ;)


--
Ludmila Marian ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to