Re: [HACKERS] multibyte charater set in levenshtein function

Alexander Korotkov Wed, 21 Jul 2010 13:52:31 -0700

On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmh...@gmail.com> wrote:


> *scratches head*  Aren't you just moving the same call to a different
> place?
>
So, where you can find this different place? :) In this patch
null-terminated strings are not used at all.


> Yeah, we usually try to avoid changing that sort of thing in existing
>  code, unless there's a very good reason.
>
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> > function which make code more complex.
>
> When you say "many" checks, how many?
>
> > Actually, we can merge all four functions into one function. But such
> > function will have many checks about multibyte encoding and max_d. So, I
> see
> > four cases here:
> > 1) one function with checks for multibyte encoding and max_d
> > 2) two functions with checks for multibyte encoding
> > 3) two functions with checks for max_d
> > 4) four separate functions
> > If you prefer case number 3 you should argue your position little more.
>
> I'm somewhat convinced that separating the multibyte case out has a
> performance benefit both by intuition and because you posted some
> numbers, but I haven't seen any argument for separating out the other
> case, so I'm asking if you've checked and whether there is an effect
> and whether it's significant.  The default is always to try to avoid
> maintaining multiple copies of substantially identical code, due to
> the danger that a future patch might fail to update all of them and
> thus introduce a bug.
>

I've tested it with big value of max_d and I thought that it's evident that
checking for negative value of max_d will not produce significant benefit.
Anyway, I tried to add checking for negative max_d into
levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len,
int ins_c, int del_c, int sub_c, int max_d)
{
    int            m,
                n;
    int           *prev;
    int           *curr;
    int            i,
                j;
    const char *x;
    const char *y;
    CharLengthAndOffset *lengths_and_offsets;
    int        y_char_len;
    int        curr_left, curr_right, prev_left, prev_right, d;
    int        delta, min_d;


    /*
     * We should calculate number of characters for multibyte encodings
     */
    m = pg_mbstrlen_with_len(s, s_len);
    n = pg_mbstrlen_with_len(t, t_len);

    /*
     * We can transform an empty s into t with n insertions, or a non-empty
t
     * into an empty s with m deletions.
     */
    if (m == 0)
        return n * ins_c;
    if (n == 0)
        return m * del_c;

    /*
     * We can find the minimal distance by the difference of lengths
     */
    delta = m - n;
    if (delta > 0)
        min_d = delta * del_c;
    else if (delta < 0)
        min_d = - delta * ins_c;
    else
        min_d = 0;
    if (max_d >= 0 && min_d > max_d)
        return max_d + 1;

    /*
     * For security concerns, restrict excessive CPU+RAM usage. (This
     * implementation uses O(m) memory and has O(mn) complexity.)
     */
    if (m > MAX_LEVENSHTEIN_STRLEN ||
        n > MAX_LEVENSHTEIN_STRLEN)
        ereport(ERROR,
                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                 errmsg("argument exceeds the maximum length of %d bytes",
                        MAX_LEVENSHTEIN_STRLEN)));

    /* One more cell for initialization column and row. */
    ++m;
    ++n;


    /*
     * Instead of building an (m+1)x(n+1) array, we'll use two different
     * arrays of size m+1 for storing accumulated values. At each step one
     * represents the "previous" row and one is the "current" row of the
     * notional large array.
     * For multibyte encoding we'll also store array of lengths of
     * characters and array with character offsets in first string
     * in order to avoid great number of
     * pg_mblen calls.
     */
    prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) *
m );
    curr = prev + m;
    lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
    lengths_and_offsets[0].offset = 0;
    for (i = 0, x = s; i < m - 1; i++)
    {
        lengths_and_offsets[i].length = pg_mblen(x);
        lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
                lengths_and_offsets[i].length;
        x += lengths_and_offsets[i].length;
    }
    lengths_and_offsets[i].length = 0;


    /* Initialize the "previous" row to 0..cols */
    curr_left = 1;
    d = min_d;
    for (i = 0; i < delta; i++)
    {
        prev[i] = d;
    }
    curr_right = m;
    for (; i < m; i++)
    {
        prev[i] = d;
        d += (ins_c + del_c);
        if (max_d >= 0 && d > max_d)
        {
            curr_right = i;
            break;
        }
    }

    /*
     * There are following optimizations:
     * 1) Actually the minimal possible value of final distance (in the case
of
     * all possible matches) is stored is the cells of the matrix. In the
case
     * of movement towards diagonal, which contain last cell, value should
be
     * increased by ins_c + del_c. In the case of movement backwards this
     * diagonal cell value should not be increased.
     * 2) The range of indexes of previous row, where values, which is not
     * greater than max_d, are stored, is [prev_left, prev_right]. So, only
     * the cells connected with this range should be calculated.
     * For multibyte encoding we should increase x and y pointers by the
     * results of pg_mblen function. Also we should use CHAR_CMP macros
     * for character comparison.
     */
    for (y = t, j = 1; j < n; y+= y_char_len, j++)
    {
        int           *temp;
        y_char_len = pg_mblen(y);

        if (max_d >= 0)
        {
            prev_left = curr_left;
            prev_right = curr_right;
            curr_left = -1;
        }else{
            prev_left = 1;
        }

        if (delta >= 0)
            curr[0] = min_d + j * (ins_c + del_c);
        else{
            if (j <= - delta)
                curr[0] = min_d;
            else
                curr[0] = min_d + (j + delta) * (ins_c + del_c);
        }

        for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i <
m; x+= lengths_and_offsets[i - 1].length, i++)
        {
            if (max_d >= 0)
            {
                d = max_d + 1;

                if (i <= prev_right){
                    d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
d);
                }

                if (i == 1 || i > prev_left)
                {
                    d = Min(curr[i - 1] + ((i - delta > j)?(ins_c +
del_c):0), d);
                    d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i
- 1].length, y, y_char_len) ? 0 : sub_c), d);
                }

                curr[i] = d;

                if (curr_left == -1)
                    curr_left = i;
                curr_right = i;
                if (i > prev_right)
                    break;
            }else{
                d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
                        prev[i - 1] + (char_cmp(x, lengths_and_offsets[i -
1].length, y, y_char_len) ? 0 : sub_c));
                curr[i] = d;
            }
        }

        if (curr_left == -1)
            break;

        temp = curr;
        curr = prev;
        prev = temp;
    }

    /*
     * Because the final value was swapped from the previous row to the
     * current row, that's where we'll find it.
     */
    d = prev[m - 1];

    /*
     * If the last cell wasn't filled than return max_d + 1 otherwise
     * return the final value.
     */
    if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
        return max_d + 1;
    else
        return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
   sum
---------
 1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
words;
   sum
---------
 1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
words;
   sum
---------
 1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with
just big value of max_d.

----
With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

Reply via email to