Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Kathy Lussier Mon, 07 May 2012 12:11:44 -0700

Hi Mike,

> FWIW, there is a library testing some new combinations of CD modifiers
> and having some success.  As soon as I know more I will share (if they
> don't first).

Did anything ever come of this? I would be interested in seeing anyexamples that resulted in improved relevancy.


> The fairly mechanical change from GIST to GIN indexing is definitely a
> small-effort thing. I think the other ideas listed here (and still
> others from the past, like direct MARC indexing, and use of tsearch
> weighting classes) are probably worth trying -- particularly the
> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
> turn out to be too big.  It's worth listing them as ideas for
> candidates to propose, though.

I was happy to see that "Optimize Evergreen: Convert PL/Perl-basedPostgreSQL stored procedures to PL/SQL or PL/C" was one of the acceptedGSoC projects. However, since I got a little lost in the technicaldetails of this discussion, I was curious if, when this GSoC project iscomplete, we can can feel more comfortable about usingsearch.relevance_ranking to tweak the relevancy without adverselyaffecting search performance.

I know there were two related GSoC ideas listed, and I wasn't sure ifboth needed to be done together to ultimately improve search speeds.


Thanks!

Kathy

--

Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 756-0172
(508) 755-3721 (fax)
kluss...@masslnc.org
Twitter: http://www.twitter.com/kmlussier

On 3/6/2012 5:00 PM, Mike Rylander wrote:

On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier<kluss...@masslnc.org>  wrote:

Hi all,

I mentioned this during an e-mail discussion on the list last month, but I
just wanted to hear from others in the Evergreen community about whether
there is a desire to improve the relevance ranking for search results in
Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can
look at things like the document length, word proximity, and unique word
count. We've found that we had to remove the modifiers for document length
and unique word count to prevent a problem where brief bib records were
ranked way too high in our search results.


FWIW, there is a library testing some new combinations of CD modifiers
and having some success.  As soon as I know more I will share (if they
don't first).


In our local discussions, we've thought the following enhancements could
improve the ranking of search results:

* Giving greater weight to a record if the search terms appear in the title
or subject (ideally, we would like these field to be configurable.) This is
something that is tweakable in search.relevance_ranking, but my
understanding is that the use of these tweaks results in a major reduction
in search performance.


Indeed they do, however rewriting them in C to be super-fast would
improve this situation.  It's primarily a matter of available time and
effort.  It's also, however, pretty specialized work as you're dealing
with Postgres at a very intimate level.

* Using some type of popularity metric to boost relevancy for popular
titles. I'm not sure what this metric should be (number of copies attached
to record? Total circs in last x months? Total current circs?), but we
believe some type of popularity measure would be particularly helpful in a
public library where searches will often be for titles that are popular. For
example, a search for "twilight" will most likely be for the Stephanie
Meyers novel and not this
http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
Rylander had indicated in a previous e-mail
(http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
handle this through an overnight cron job without a negative impact on
search speeds.


Right ... A regular stats-gathering job could certainly allow this,
and (if the "QuqeryParser explain" branch gets merged to master so we
have a standard search canonicalization function) logged query
analysis is another option as well.


Do others think these two enhancements would improve the search results in
Evergreen? Do you think there are other things we could do to improve
relevancy? My main concern would be that any changes might slow down search
speeds, and I would want to make sure that we could do something to retrieve
better search results without a slowdown.


I would prefer better results with a speed /increase/! :)  But, who wouldn't.

I can offer at least one lower-hanging fruit idea: switch from GIST
indexes to GIN indexes by default, as they're much faster these days.

Also, I was wondering if this type of project might be a good candidate for
a Google Summer of Code project.


The fairly mechanical change from GIST to GIN indexing is definitely a
small-effort thing. I think the other ideas listed here (and still
others from the past, like direct MARC indexing, and use of tsearch
weighting classes) are probably worth trying -- particularly the
relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
turn out to be too big.  It's worth listing them as ideas for
candidates to propose, though.

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Reply via email to