On Fri, Dec 18, 2015 at 11:43 AM, Artur Zakirov <a.zaki...@postgrespro.ru> wrote: > Hello. > > PostgreSQL has a contrib module named pg_trgm. It is used to the fuzzy text > search. It provides some functions and operators for determining the > similarity of the given texts using trigram matching. > > At the moment, in pg_trgm both the similarity function and the % operator > match two strings expecting that they are similar entirely. But they give > bad results if we want to find documents by a query which is substring of a > document. > > For this purpose we need a new operator which enables to find strings that > have a fragment most similar to the given string. This patch adds some > functions and operator: > - function substring_similarity - Returns a number that indicates how > similar the first string to the most similar substring of the second string. > The range of the result is zero (indicating that the two strings are > completely dissimilar) to one (indicating that the first string is identical > to substring of the second substring).
The behavior of this function is surprising to me. select substring_similarity('dog' , 'hotdogpound') ; substring_similarity ---------------------- 0.25 The most similar substring of the second string would be 'dog', which has a similarity of 1.0 if it were an isolated string. I think that this function should refrain from calculating the space-padded trigrams at the beginning and end of its first argument. Since the first argument is expected to be evaluated as a substring, I wouldn't take it for granted that the beginning and end of the string as given are at the beginning and end of a 'word', which is what is currently being enforced. Or perhaps the 2nd argument needs to have ghost trigrams computed at each start and stop when comparing that substring to the full argument. If we prefer the current behavior, then I think the explanation needs to be changed. Also, should we have a function which indicates the position in the 2nd string at which the most similar match to the 1st argument occurs? select substring_similarity_pos('dog' , 'hotdogpound') ; answering: 4 When the long string is much longer, it can be hard to figure out where the best match was (or the first such match, if there is a series of ties). Finally, should there be operators for returning the distance, analogous to the <-> operator? It is awkward to use an operator for the boolean and a function to get the distance. Especially when you use %> and so need swap the order of arguments between the operator and the function. We could call them <<-> and <->> , where the first corresponds to <% and the second to %> Cheers, Jeff -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers