Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

James Fournie Tue, 06 Mar 2012 15:13:38 -0800

>> 
>> * Giving greater weight to a record if the search terms appear in the title
>> or subject (ideally, we would like these field to be configurable.) This is
>> something that is tweakable in search.relevance_ranking, but my
>> understanding is that the use of these tweaks results in a major reduction
>> in search performance.
>> 
> 
> Indeed they do, however rewriting them in C to be super-fast would
> improve this situation.  It's primarily a matter of available time and
> effort.  It's also, however, pretty specialized work as you're dealing
> with Postgres at a very intimate level.


Mike, could you elaborate what bits of code you're talking about here that 
could be rewritten in C?

Some of my colleagues at Sitka and I were trying to find out why broad searches 
are unusually slow and eventually found that our adjustments in 
search.relevance_adjustment were slowing things down.   Months earlier the CD 
patch was added to trunk to circumvent this problem without our knowledge, so 
we tried backporting that code and testing it however, in our initial tests, we 
weren't entirely satisfied with the CD modifiers' ability to rank items.  

Doing some digging into the SQL logs and QueryParser.pm, we observed that the 
naco_normalize function appears to be what's slowing the use of 
relevance_adjustment down.  While the naco_normalize function itself is quite 
fast on its own, it slows down exponentially when run on many records:

explain analyze select naco_normalize(value) from metabib.keyword_field_entry 
limit 10000;

When using the relevance adjustments, it is run on each metabib.x_entry.value 
that is retrieved in the initial resultset, which in many cases would be 
thousands of records.  You can adjust the LIMIT in the above query to see how 
it slows down as the result set gets larger.  It is also run for each 
relevance_adjustment, however I'm assuming that the query parser is treating it 
properly as IMMUTABLE and only running it once for each adjustment.

Anyway, not entirely sure about how this analysis holds up in trunk as we've 
done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code 
in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits. 
 

I've attached some sample SQL of part of a 2.0 query and the same query without 
naco_normalize run on the metabib table.  In my testing on our production 
dataset, this query -- a search for "Canada" -- went from over 80 seconds to 
less than 10 by removing the naco_normalize (it's still being run on the 
incoming term though which is probably unavoidable)

My thought for a solution would be that we could have naco_normalize run as an 
INSERT trigger on that field.  Obviously the whole tables would need to be 
updated which is no small task.  I'm also not sure if that would impact other 
things, ie: where else the metabib.x_field_entry.value field is used, but but 
generally I'd think we'd almost always be using that value for a comparison of 
some kind and want that value in a normalized form.   Another option may be to 
not normalize in those comparisons, however it's slightly less attractive IMO. 
Anyway I'd be interested to hear your thoughts on that.

-- normal EG 2.0 query
EXPLAIN ANALYZE
 SELECT  * /* bib search */
           FROM  search.query_parser_fts(
                     1::INT,
                     0::INT,
                     $core_query_25078$SELECT  m.source AS id,
         ARRAY_ACCUM(DISTINCT m.source) AS records,
         (AVG(
   (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
      *  /* word_order */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
       *  /* first_word */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rel,
          (AVG(
    (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
       *  /* word_order */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
       *  /* first_word */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rank, 
          FIRST(mrd.date1) AS tie_break
    FROM  metabib.metarecord_source_map m
          JOIN metabib.rec_descriptor mrd ON (m.source = mrd.record)
          
   LEFT JOIN (
    SELECT fe.*, fe_weight.weight, x.tsq /* search */
      FROM  metabib.keyword_field_entry AS fe
     JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
     JOIN (SELECT 
     to_tsquery('keyword', COALESCE(NULLIF( '(' || 
btrim(regexp_replace(split_date_range(naco_normalize($_25078$canada$_25078$)),E'(?:\\s+|:)','&','g'),'&|')
  || ')', '()'), '')) AS tsq ) AS x ON (fe.index_vector @@ x.tsq)
   ) AS x7b52820_keyword ON (m.source = x7b52820_keyword.source)
    WHERE 1=1
          AND ((x7b52820_keyword.id IS NOT NULL))
    GROUP BY 1
    ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
    LIMIT 10000
  $core_query_25078$::TEXT,
                      $${}$$::INT[],
                      $${}$$::INT[],
                      NULL::INT,
                      1000::INT,
                      10000::INT,
                      'f'::BOOL,
                      'f'::BOOL
                  );

-- removed the naco_normalize on the metabib table 
EXPLAIN ANALYZE
 SELECT  * /* bib search */
           FROM  search.query_parser_fts(
                     1::INT,
                     0::INT,
                     $core_query_25078$SELECT  m.source AS id,
         ARRAY_ACCUM(DISTINCT m.source) AS records,
         (AVG(
   (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
      *  /* word_order */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
       *  /* first_word */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rel,
          (AVG(
    (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
       *  /* word_order */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
       *  /* first_word */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( ((x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rank, 
          FIRST(mrd.date1) AS tie_break
    FROM  metabib.metarecord_source_map m
          JOIN metabib.rec_descriptor mrd ON (m.source = mrd.record)
          
   LEFT JOIN (
    SELECT fe.*, fe_weight.weight, x.tsq /* search */
      FROM  metabib.keyword_field_entry AS fe
     JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
     JOIN (SELECT 
     to_tsquery('keyword', COALESCE(NULLIF( '(' || 
btrim(regexp_replace(split_date_range(naco_normalize($_25078$canada$_25078$)),E'(?:\\s+|:)','&','g'),'&|')
  || ')', '()'), '')) AS tsq ) AS x ON (fe.index_vector @@ x.tsq)
   ) AS x7b52820_keyword ON (m.source = x7b52820_keyword.source)
    WHERE 1=1
          AND ((x7b52820_keyword.id IS NOT NULL))
    GROUP BY 1
    ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
    LIMIT 10000
  $core_query_25078$::TEXT,
                      $${}$$::INT[],
                      $${}$$::INT[],
                      NULL::INT,
                      1000::INT,
                      10000::INT,
                      'f'::BOOL,
                      'f'::BOOL
                  );



~James Fournie
BC Libraries Cooperative

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Reply via email to