On Wed, Mar 7, 2012 at 2:28 AM, Dan Scott <[email protected]> wrote: > Lots of <snip>s implied below; also note that I'm running James' 2.0 > query on a 2.1 system. > > On Tue, Mar 06, 2012 at 10:55:24PM -0500, Mike Rylander wrote: >> On Tue, Mar 6, 2012 at 6:13 PM, James Fournie >> <[email protected]> wrote: >> >>> >> >>> * Giving greater weight to a record if the search terms appear in the >> >>> title >> >>> or subject (ideally, we would like these field to be configurable.) This >> >>> is >> >>> something that is tweakable in search.relevance_ranking, but my >> >>> understanding is that the use of these tweaks results in a major >> >>> reduction >> >>> in search performance. >> >>> >> >> >> >> Indeed they do, however rewriting them in C to be super-fast would >> >> improve this situation. It's primarily a matter of available time and >> >> effort. It's also, however, pretty specialized work as you're dealing >> >> with Postgres at a very intimate level. > > Hmm. For sure, C beats Perl for performance and would undoubtedly offer an > improvement, but it looks like another bottleneck for broad searches is > in having to visit & sort hundreds of thousands of rows, so that they > can be sorted by rank, with the added I/O cost of using a disk merge for > these broad searches rather than in-memory quicksort. > > For comparison, I swapped out 'canada' for 'paraguay' and explain > analyzed the results; 'canada' uses a disk merge because it needs to > deal with 482,000 rows of data and sort 596,000 KB of data, while > 'paraguay' (which only has to sort 322 rows) used an in-memory quicksort > at 582 KB. > > This is on a system where work_mem is set to 288 MB - much higher than > one would generally want, particularly for the number of physical > connections that could potentially get ramped up. That high work_mem > helps with reasonably broad searches, but searching for "Canada" in a > Canadian academic library, you might as well be searching for "the"... >
All true, but also not something we can do much about (without a precalculated rank, a la PageRank); also, testing on 9.0 around its release shows that pre-limiting as we used to was slower than what we do today. I don't have the details in front of me, but there it is. > >> Indeed, and naco_normalize is not necessarily the only normalizer that >> will be applied to each and every field! If you search a class or >> field that uses other (pos >= 0) normalizers, all of those will also >> be applied to both the column value and the user input. >> >> There's some good news on this front, though. Galen recently >> implemented a trimmed down version of naco_normalize, called >> search_normalize, that should be a bit faster. That should lower the >> total cost by a noticeable amount over many thousands of rows. > > You might be thinking of something else? I'm pretty sure that > 2bc4e97f72b shows that I implemented search_normalize() simply to avoid > problems with apostrophe mangling in the strict naco_normalize() > function - and I doubt there will be any observable difference in > performance. Sorry, I probably am. My apologies, Dan, I didn't intend to misdirect your credit. That said, I'd be surprised if anything that shortened the pl/perl we use didn't help some, in aggregate, on very large queries. It's testable... > >> Hrm... and looking at your example, I spotted a chance for at least >> one optimization. If we recognize that there is only one term in a >> search (as in your "canada" example) we can skip the word-order >> rel_adjustment if we're told to apply it, saving ~1/3 of the cost of >> that particular chunk. > > I can confirm this; running the same query on our system with word order > removed carved response times down to 390 seconds from 580 seconds. > Still unusable, but better. (EXPLAIN ANALYZE of the inner query > attached). > >> * Alternatively, I mentioned off-hand the option of direct indexing. >> The idea is to use an expression index defined by each row in in >> config.metabib_field and have a background process keep the indexes in >> sync with configuration as things in that table (and the related >> normalizer configuration, etc) changes. I fear that's the path to >> madness, but it would be the most space efficient way to handle >> things. > > I don't think that's the path to madness; it appeals to me, at least. > (Okay, it's probably insane then.) > In a broad sense it appeals to me, too. You and I were the ones who discussed this in the long-long-ago, IYR. It's when I start digging into the details of implementation and the implications for config-change-based thrashing that I start going a little mad ... OH! Ranking via ts_rank[_cd]. We can't do it without the tsvector in hand. But if we can work around that somehow (a table that stores only that value, reducing the sort size you mention above?), it's not impossible. >> [Side note: if you don't need the language-base relevance bump >> (because, say, the vast majority of your collection is english), >> remove the default_preferred_language[_weight] elements from your >> opensrf.xml -- you should save a good bit from that alone.] > > You may also want to remove that if you have a collection with an evenly > distributed mix of languages and a corresponding user base. With our > bilingual population & collection, and languages (English & French) that > share a lot of similar roots (particularly when an English stemming > algorithm is applied), the added bump for English was quite disturbing > for francophones & quickly disabled. > Good point! > Also, it appears that removing the pertinent clause didn't affect > response times at all. But it's 2:30 am, so I should stop testing and > trying to draw conclusions at this point! That should be looked into. Removing the weight setting, in particular, should have caused the whole language check to go away. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: [email protected] | web: http://www.esilibrary.com
