On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier <[email protected]> wrote: > Hi all, > > I mentioned this during an e-mail discussion on the list last month, but I > just wanted to hear from others in the Evergreen community about whether > there is a desire to improve the relevance ranking for search results in > Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can > look at things like the document length, word proximity, and unique word > count. We've found that we had to remove the modifiers for document length > and unique word count to prevent a problem where brief bib records were > ranked way too high in our search results.
FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). > > In our local discussions, we've thought the following enhancements could > improve the ranking of search results: > > * Giving greater weight to a record if the search terms appear in the title > or subject (ideally, we would like these field to be configurable.) This is > something that is tweakable in search.relevance_ranking, but my > understanding is that the use of these tweaks results in a major reduction > in search performance. > Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. > * Using some type of popularity metric to boost relevancy for popular > titles. I'm not sure what this metric should be (number of copies attached > to record? Total circs in last x months? Total current circs?), but we > believe some type of popularity measure would be particularly helpful in a > public library where searches will often be for titles that are popular. For > example, a search for "twilight" will most likely be for the Stephanie > Meyers novel and not this > http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike > Rylander had indicated in a previous e-mail > (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to > handle this through an overnight cron job without a negative impact on > search speeds. Right ... A regular stats-gathering job could certainly allow this, and (if the "QuqeryParser explain" branch gets merged to master so we have a standard search canonicalization function) logged query analysis is another option as well. > > Do others think these two enhancements would improve the search results in > Evergreen? Do you think there are other things we could do to improve > relevancy? My main concern would be that any changes might slow down search > speeds, and I would want to make sure that we could do something to retrieve > better search results without a slowdown. > I would prefer better results with a speed /increase/! :) But, who wouldn't. I can offer at least one lower-hanging fruit idea: switch from GIST indexes to GIN indexes by default, as they're much faster these days. > Also, I was wondering if this type of project might be a good candidate for > a Google Summer of Code project. > The fairly mechanical change from GIST to GIN indexing is definitely a small-effort thing. I think the other ideas listed here (and still others from the past, like direct MARC indexing, and use of tsearch weighting classes) are probably worth trying -- particularly the relevance-adjustment-functions-in-C idea -- as GSoC projects, but may turn out to be too big. It's worth listing them as ideas for candidates to propose, though. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: [email protected] | web: http://www.esilibrary.com
