Here is new version of my patch. There are following changes: 1) I've merged singlebyte and multibyte versions of levenshtein_internal and levenshtein_less_equal_internal using macros and includes. 2) I found that levenshtein takes reasonable time even for long strings. There is an example with strings which contain random combinations of words.
test=# select count(*) from words; count ------- 98569 (1 row) test=# select * from words limit 10; id | word | next_id -------+------------------------------------------------------------------------------------+--------- 200 | Agnew diodes drilled composts Wassermann nonprofit adjusting Chautauqua | 17370 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal punctuality | 96685 6103 | G prompt boron overthrew cricking begonia snuffers blazed | 34518 4707 | Djibouti Mari pencilling approves pointing's pashas magpie rams | 71200 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's inhospitable | 28018 9615 | Leos deputy evener yogis assault palatals Octobers surceased | 21878 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier Algeria | 19316 15776 | Sucre battle's misapprehension's predicating nutmegged electrification bowl planks | 65739 16878 | Updike Forster ragout keenly exhalation grayed switch guava's | 43013 (10 rows) Time: 26,125 ms Time: 137,061 ms test=# select avg(length(word)) from words; avg --------------------- 74.6052207083362923 (1 row) Time: 96,415 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')<=10; id | word | next_id ------+-----------------------------------------------------------------+--------- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2957,426 ms test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',10)<=10; id | word | next_id ------+-----------------------------------------------------------------+--------- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 84,755 ms It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why I changed limitation of this function to max_d <= 255 * min(ins_c, del_c) 3) I found that there is no need for storing character offsets in levenshtein_less_equal_internal_mb. Instead of this first position in string, where value of distance wasn't greater than max_d, can be stored. 4) I found that there is theoretical maximum distance based of string lengths. It represents the case, when no characters are matching. If theoretical maximum distance is less than given maximum distance we can use theoretical maximum distance. It improves behavior of levenshtein_less_equal with high max_d, but it's still slower than levenshtein: test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',1000)<=10; id | word | next_id ------+-----------------------------------------------------------------+--------- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 4102,731 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')<=10; id | word | next_id ------+-----------------------------------------------------------------+--------- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2983,244 ms ---- With best regards, Alexander Korotkov.
fuzzystrmatch-0.5.tar.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers