Hackers, I'm investigating the bug report [1] about the behavior of websearch_to_tsquery() with quotes and multi-lexeme tokens. See the example below.
# select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"'); ?column? ---------- f So, tsvector doesn't match tsquery, when absolutely the same text was put to the to_tsvector() and to the quotes of websearch_to_tsquery(). Looks wrong to me. Let's examine output of to_tsvector() and websearch_to_tsquery(). # select to_tsvector('pg_class foo'); to_tsvector -------------------------- 'class':2 'foo':3 'pg':1 # select websearch_to_tsquery('"pg_class foo"'); websearch_to_tsquery ------------------------------ ( 'pg' & 'class' ) <-> 'foo' (1 row) So, 'pg_class' token was split into two lexemes 'pg' and 'class'. But the output websearch_to_tsquery() connects 'pg' and 'class' with & operator. tsquery expects 'pg' and 'class' to be both neighbors of 'foo'. So, 'pg' and 'class' are expected to share the same position, and that isn't true for tsvector. Let's see how phraseto_tsquery() handles that. # select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo'); ?column? ---------- t # select phraseto_tsquery('pg_class foo'); phraseto_tsquery ---------------------------- 'pg' <-> 'class' <-> 'foo' phraseto_tsquery() connects all the lexemes with phrase operators and everything works OK. For me it's obvious that phraseto_tsquery() and websearch_to_tsquery() with quotes should work the same way. Noticeably, current behavior of websearch_to_tsquery() is recorded in the regression tests. So, it might look that this behavior is intended, but it's too ridiculous and I think the regression tests contain oversight as well. I've prepared a fix, which doesn't break the fts parser abstractions too much (attached patch), but I've faced another similar issue in to_tsquery(). # select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo'); ?column? ---------- f # select to_tsquery('pg_class <-> foo'); to_tsquery ------------------------------ ( 'pg' & 'class' ) <-> 'foo' I think if a user writes 'pg_class <-> foo', then it's expected to match 'pg_class foo' independently on which lexemes 'pg_class' is split into. This issue looks like the much more complex design bug in phrase search. Fixing this would require some kind of readahead or multipass processing, because we don't know how to process 'pg_class' in advance. Is this really a design bug existing in phrase search from the beginning. Or am I missing something? Links 1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org ------ Regards, Alexander Korotkov
websearch_fix_p2.patch
Description: Binary data