On Tue, Mar 6, 2012 at 6:13 PM, James Fournie <[email protected]> wrote: >>> >>> * Giving greater weight to a record if the search terms appear in the title >>> or subject (ideally, we would like these field to be configurable.) This is >>> something that is tweakable in search.relevance_ranking, but my >>> understanding is that the use of these tweaks results in a major reduction >>> in search performance. >>> >> >> Indeed they do, however rewriting them in C to be super-fast would >> improve this situation. It's primarily a matter of available time and >> effort. It's also, however, pretty specialized work as you're dealing >> with Postgres at a very intimate level. > > Mike, could you elaborate what bits of code you're talking about here that > could be rewritten in C? >
I mean specifically the elaborate COALESCE/NULLIF/regexp (aka ~) parts of the SELECT clauses that implement the first-word, word-order and full-phrase relevance bumps that come from search.relevance_adjustment. There's also the option of attempting to rewrite naco_normalize and search_normalize (see below) in C. Lots of string mangling to which Perl is particularly suited, but it's not impossible by any means, and there are Postgres components (the 'unaccent' contrib/extension, for instance) that we could probably build on. > Some of my colleagues at Sitka and I were trying to find out why broad > searches are unusually slow and eventually found that our adjustments in > search.relevance_adjustment were slowing things down. Months earlier the CD > patch was added to trunk to circumvent this problem without our knowledge, so > we tried backporting that code and testing it however, in our initial tests, > we weren't entirely satisfied with the CD modifiers' ability to rank items. > Right. These are more subtle than the heavy-handed search.relevance_adjustment settings, and therefore have a less drastic effect. But they also reduce the need for some of the search.relevance_adjustment entries, so in combination we should be able to find a good balance, especially if some of the rel_adjustment effects can be rewritten in C. > Doing some digging into the SQL logs and QueryParser.pm, we observed that the > naco_normalize function appears to be what's slowing the use of > relevance_adjustment down. While the naco_normalize function itself is quite > fast on its own, it slows down exponentially when run on many records: > > explain analyze select naco_normalize(value) from metabib.keyword_field_entry > limit 10000; > > When using the relevance adjustments, it is run on each metabib.x_entry.value > that is retrieved in the initial resultset, which in many cases would be > thousands of records. You can adjust the LIMIT in the above query to see how > it slows down as the result set gets larger. It is also run for each > relevance_adjustment, however I'm assuming that the query parser is treating > it properly as IMMUTABLE and only running it once for each adjustment. > Indeed, and naco_normalize is not necessarily the only normalizer that will be applied to each and every field! If you search a class or field that uses other (pos >= 0) normalizers, all of those will also be applied to both the column value and the user input. There's some good news on this front, though. Galen recently implemented a trimmed down version of naco_normalize, called search_normalize, that should be a bit faster. That should lower the total cost by a noticeable amount over many thousands of rows. > Anyway, not entirely sure about how this analysis holds up in trunk as we've > done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new > code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to > those bits. > > I've attached some sample SQL of part of a 2.0 query and the same query > without naco_normalize run on the metabib table. In my testing on our > production dataset, this query -- a search for "Canada" -- went from over 80 > seconds to less than 10 by removing the naco_normalize (it's still being run > on the incoming term though which is probably unavoidable) > It is unavoidable, but they should only be run once on user input and the result cached. EXPLAIN will tell the tale, and if it's not then the normalizer functions aren't properly marked STABLE. Hrm... and looking at your example, I spotted a chance for at least one optimization. If we recognize that there is only one term in a search (as in your "canada" example) we can skip the word-order rel_adjustment if we're told to apply it, saving ~1/3 of the cost of that particular chunk. > My thought for a solution would be that we could have naco_normalize run as > an INSERT trigger on that field. Obviously the whole tables would need to be > updated which is no small task. I'm also not sure if that would impact other > things, ie: where else the metabib.x_field_entry.value field is used, but but > generally I'd think we'd almost always be using that value for a comparison > of some kind and want that value in a normalized form. Another option may > be to not normalize in those comparisons, however it's slightly less > attractive IMO. Anyway I'd be interested to hear your thoughts on that. > We need the pre-normalized form for some things, but there are options (that all have tradeoffs, of course): * We could find those things, other than search, for which we use m.X_entry.value and move them elsewhere. The tradeoff would be that any change in config.metabib_field or normalizer configuration would have to cause a rewrite of that column. * We could store a normal_value version that is fully normalized, but that will nearly double the table size, and that column would still have to be rewritten every time a config.metabib_field row or normalizer configuration changes. In this case, though, we'd at least still have the original value and wouldn't have to go back to the MARC. * Alternatively, I mentioned off-hand the option of direct indexing. The idea is to use an expression index defined by each row in in config.metabib_field and have a background process keep the indexes in sync with configuration as things in that table (and the related normalizer configuration, etc) changes. I fear that's the path to madness, but it would be the most space efficient way to handle things. IMO, since we have a couple angles of attack under the current schema for optimizing what we're doing, those are the safer course to start with. [Side note: if you don't need the language-base relevance bump (because, say, the vast majority of your collection is english), remove the default_preferred_language[_weight] elements from your opensrf.xml -- you should save a good bit from that alone.] -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: [email protected] | web: http://www.esilibrary.com
