On Thu, Oct 10, 2013 at 11:00 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > On Thu, Oct 10, 2013 at 7:12 AM, Heikki Linnakangas > <hlinnakan...@vmware.com> wrote: >> On 10.10.2013 15:03, Fujii Masao wrote: >>> >>> Hi, >>> >>> The behavior of pg_trgm's similarity function seems strange. Is this >>> intentional? >>> >>> I was thinking that the following three calls of the similarity function >>> return >>> the same number because the second argument is just the three characters >>> contained in the first argument in every calls. >>> >>> =# SELECT similarity('12345', '123'); >>> =# SELECT similarity('12345', '234'); >>> =# SELECT similarity('12345', '345'); >>> >>> But that's not true. Each returns the different number. >>> >>> =# SELECT similarity('12345', '123'); >>> similarity >>> ------------ >>> 0.428571 >>> (1 row) >>> >>> =# SELECT similarity('12345', '234'); >>> similarity >>> ------------ >>> 0.111111 >>> (1 row) >>> >>> =# SELECT similarity('12345', '345'); >>> similarity >>> ------------ >>> 0.25 >>> (1 row) >>> >>> This happens because, for example, similarity('12345', '123') returns >>> the similarity number of '**12345*' and '**123*' (* means the blank >>> character), >>> NOT '12345' and '123'. IOW, two and one blank characters are added into >>> the heading and tailing of each argument, respectively. I wonder why >>> pg_trgm's similarity function works in this way. We should change this >>> so that no blank characters are added into the arguments? >> >> >> Well, you could also argue that "111111" and "222222" are quite similar, >> even though pg_trgm's similarity will not think so. It comes down to the >> definition of similarity, and how well that definition matches your >> intuition. >> >> FWIW, it feels right to me that a match in the beginning of a word is worth >> more than one in the middle of a string. -1 on changing that.
Okay, understood. > I'm not so sure that the assumption that leading trigrams should > effectively weight > 3x is a good one to build into the library. > However, the behavior is clearly documented and can't be changed. I > think you'd need to improvise an alternate set of "trigram ops" if you > wanted to rig an alternate matching behavior. Yeah, this makes sense. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers