On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmh...@gmail.com> wrote:
> *scratches head* Aren't you just moving the same call to a different > place? > So, where you can find this different place? :) In this patch null-terminated strings are not used at all. > Yeah, we usually try to avoid changing that sort of thing in existing > code, unless there's a very good reason. > Ok. > In these case we should add many checks of max_d in levenshtein_internal > > function which make code more complex. > > When you say "many" checks, how many? > > > Actually, we can merge all four functions into one function. But such > > function will have many checks about multibyte encoding and max_d. So, I > see > > four cases here: > > 1) one function with checks for multibyte encoding and max_d > > 2) two functions with checks for multibyte encoding > > 3) two functions with checks for max_d > > 4) four separate functions > > If you prefer case number 3 you should argue your position little more. > > I'm somewhat convinced that separating the multibyte case out has a > performance benefit both by intuition and because you posted some > numbers, but I haven't seen any argument for separating out the other > case, so I'm asking if you've checked and whether there is an effect > and whether it's significant. The default is always to try to avoid > maintaining multiple copies of substantially identical code, due to > the danger that a future patch might fail to update all of them and > thus introduce a bug. > I've tested it with big value of max_d and I thought that it's evident that checking for negative value of max_d will not produce significant benefit. Anyway, I tried to add checking for negative max_d into levenshtein_less_equal_mb function. static int levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len, int ins_c, int del_c, int sub_c, int max_d) { int m, n; int *prev; int *curr; int i, j; const char *x; const char *y; CharLengthAndOffset *lengths_and_offsets; int y_char_len; int curr_left, curr_right, prev_left, prev_right, d; int delta, min_d; /* * We should calculate number of characters for multibyte encodings */ m = pg_mbstrlen_with_len(s, s_len); n = pg_mbstrlen_with_len(t, t_len); /* * We can transform an empty s into t with n insertions, or a non-empty t * into an empty s with m deletions. */ if (m == 0) return n * ins_c; if (n == 0) return m * del_c; /* * We can find the minimal distance by the difference of lengths */ delta = m - n; if (delta > 0) min_d = delta * del_c; else if (delta < 0) min_d = - delta * ins_c; else min_d = 0; if (max_d >= 0 && min_d > max_d) return max_d + 1; /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ if (m > MAX_LEVENSHTEIN_STRLEN || n > MAX_LEVENSHTEIN_STRLEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", MAX_LEVENSHTEIN_STRLEN))); /* One more cell for initialization column and row. */ ++m; ++n; /* * Instead of building an (m+1)x(n+1) array, we'll use two different * arrays of size m+1 for storing accumulated values. At each step one * represents the "previous" row and one is the "current" row of the * notional large array. * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of * pg_mblen calls. */ prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) * m ); curr = prev + m; lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m); lengths_and_offsets[0].offset = 0; for (i = 0, x = s; i < m - 1; i++) { lengths_and_offsets[i].length = pg_mblen(x); lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset + lengths_and_offsets[i].length; x += lengths_and_offsets[i].length; } lengths_and_offsets[i].length = 0; /* Initialize the "previous" row to 0..cols */ curr_left = 1; d = min_d; for (i = 0; i < delta; i++) { prev[i] = d; } curr_right = m; for (; i < m; i++) { prev[i] = d; d += (ins_c + del_c); if (max_d >= 0 && d > max_d) { curr_right = i; break; } } /* * There are following optimizations: * 1) Actually the minimal possible value of final distance (in the case of * all possible matches) is stored is the cells of the matrix. In the case * of movement towards diagonal, which contain last cell, value should be * increased by ins_c + del_c. In the case of movement backwards this * diagonal cell value should not be increased. * 2) The range of indexes of previous row, where values, which is not * greater than max_d, are stored, is [prev_left, prev_right]. So, only * the cells connected with this range should be calculated. * For multibyte encoding we should increase x and y pointers by the * results of pg_mblen function. Also we should use CHAR_CMP macros * for character comparison. */ for (y = t, j = 1; j < n; y+= y_char_len, j++) { int *temp; y_char_len = pg_mblen(y); if (max_d >= 0) { prev_left = curr_left; prev_right = curr_right; curr_left = -1; }else{ prev_left = 1; } if (delta >= 0) curr[0] = min_d + j * (ins_c + del_c); else{ if (j <= - delta) curr[0] = min_d; else curr[0] = min_d + (j + delta) * (ins_c + del_c); } for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i < m; x+= lengths_and_offsets[i - 1].length, i++) { if (max_d >= 0) { d = max_d + 1; if (i <= prev_right){ d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), d); } if (i == 1 || i > prev_left) { d = Min(curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0), d); d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c), d); } curr[i] = d; if (curr_left == -1) curr_left = i; curr_right = i; if (i > prev_right) break; }else{ d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0), curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)), prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c)); curr[i] = d; } } if (curr_left == -1) break; temp = curr; curr = prev; prev = temp; } /* * Because the final value was swapped from the previous row to the * current row, that's where we'll find it. */ d = prev[m - 1]; /* * If the last cell wasn't filled than return max_d + 1 otherwise * return the final value. */ if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1)) return max_d + 1; else return d; } I tested it with american-english dictionary with 98569 words. test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words; sum --------- 1074376 (1 row) Time: 131,435 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from words; sum --------- 1074376 (1 row) Time: 221,078 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from words; sum --------- 1074376 (1 row) Time: 254,819 ms The function with negative value of max_d didn't become faster than with just big value of max_d. ---- With best regards, Alexander Korotkov.