On Mon, May 7, 2012 at 3:12 PM, Kathy Lussier <[email protected]> wrote: > Hi Mike, > > >> FWIW, there is a library testing some new combinations of CD modifiers >> and having some success. As soon as I know more I will share (if they >> don't first). > > Did anything ever come of this? I would be interested in seeing any examples > that resulted in improved relevancy. >
The testing occurred, but I haven't heard the the outcome yet. I'll dig for it ASAP. > >> The fairly mechanical change from GIST to GIN indexing is definitely a >> small-effort thing. I think the other ideas listed here (and still >> others from the past, like direct MARC indexing, and use of tsearch >> weighting classes) are probably worth trying -- particularly the >> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may >> turn out to be too big. It's worth listing them as ideas for >> candidates to propose, though. > > I was happy to see that "Optimize Evergreen: Convert PL/Perl-based > PostgreSQL stored procedures to PL/SQL or PL/C" was one of the accepted GSoC > projects. However, since I got a little lost in the technical details of > this discussion, I was curious if, when this GSoC project is complete, we > can can feel more comfortable about using search.relevance_ranking to tweak > the relevancy without adversely affecting search performance. > Short version: yes Longer version: that's exactly one of the goals, and there are some other avenues of attack as well that should speed search and are related to (but not strictly inside) the GSoC project. --miker > I know there were two related GSoC ideas listed, and I wasn't sure if both > needed to be done together to ultimately improve search speeds. > > Thanks! > > Kathy > > > -- > > Kathy Lussier > Project Coordinator > Massachusetts Library Network Cooperative > (508) 756-0172 > (508) 755-3721 (fax) > [email protected] > Twitter: http://www.twitter.com/kmlussier > > > On 3/6/2012 5:00 PM, Mike Rylander wrote: >> >> On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier<[email protected]> >> wrote: >>> >>> Hi all, >>> >>> I mentioned this during an e-mail discussion on the list last month, but >>> I >>> just wanted to hear from others in the Evergreen community about whether >>> there is a desire to improve the relevance ranking for search results in >>> Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it >>> can >>> look at things like the document length, word proximity, and unique word >>> count. We've found that we had to remove the modifiers for document >>> length >>> and unique word count to prevent a problem where brief bib records were >>> ranked way too high in our search results. >> >> >> FWIW, there is a library testing some new combinations of CD modifiers >> and having some success. As soon as I know more I will share (if they >> don't first). >> >>> >>> In our local discussions, we've thought the following enhancements could >>> improve the ranking of search results: >>> >>> * Giving greater weight to a record if the search terms appear in the >>> title >>> or subject (ideally, we would like these field to be configurable.) This >>> is >>> something that is tweakable in search.relevance_ranking, but my >>> understanding is that the use of these tweaks results in a major >>> reduction >>> in search performance. >>> >> >> Indeed they do, however rewriting them in C to be super-fast would >> improve this situation. It's primarily a matter of available time and >> effort. It's also, however, pretty specialized work as you're dealing >> with Postgres at a very intimate level. >> >>> * Using some type of popularity metric to boost relevancy for popular >>> titles. I'm not sure what this metric should be (number of copies >>> attached >>> to record? Total circs in last x months? Total current circs?), but we >>> believe some type of popularity measure would be particularly helpful in >>> a >>> public library where searches will often be for titles that are popular. >>> For >>> example, a search for "twilight" will most likely be for the Stephanie >>> Meyers novel and not this >>> http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike >>> Rylander had indicated in a previous e-mail >>> (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to >>> handle this through an overnight cron job without a negative impact on >>> search speeds. >> >> >> Right ... A regular stats-gathering job could certainly allow this, >> and (if the "QuqeryParser explain" branch gets merged to master so we >> have a standard search canonicalization function) logged query >> analysis is another option as well. >> >>> >>> Do others think these two enhancements would improve the search results >>> in >>> Evergreen? Do you think there are other things we could do to improve >>> relevancy? My main concern would be that any changes might slow down >>> search >>> speeds, and I would want to make sure that we could do something to >>> retrieve >>> better search results without a slowdown. >>> >> >> I would prefer better results with a speed /increase/! :) But, who >> wouldn't. >> >> I can offer at least one lower-hanging fruit idea: switch from GIST >> indexes to GIN indexes by default, as they're much faster these days. >> >>> Also, I was wondering if this type of project might be a good candidate >>> for >>> a Google Summer of Code project. >>> >> >> The fairly mechanical change from GIST to GIN indexing is definitely a >> small-effort thing. I think the other ideas listed here (and still >> others from the past, like direct MARC indexing, and use of tsearch >> weighting classes) are probably worth trying -- particularly the >> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may >> turn out to be too big. It's worth listing them as ideas for >> candidates to propose, though. >> > -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: [email protected] | web: http://www.esilibrary.com
