Re: [HACKERS] multibyte charater set in levenshtein function

Alexander Korotkov Tue, 13 Jul 2010 08:06:23 -0700

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
>  similar. Can you merge the code? We can always use less_equal_internal()
>  if the overhead is ignorable. Did you compare them?
>
With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.


test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
   sum
---------
 1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
   sum
---------
 1386456
(1 row)

Time: 317,821 ms


> * There are many "if (!multibyte_encoding)" in levenshtein_internal().
>  How about split the function into two funcs for single-byte chars and
>  multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
>  use multi-byte version if the overhead is small.
>
The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
  id   |   word
-------+----------
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.


> * I prefer a struct rather than an array.  "4 * m" and "3 * m" look like
> magic
>  numbers for me. Could you name the entries with definition of a struct?
>    /*
>     * For multibyte encoding we'll also store array of lengths of
>     * characters and array with character offsets in first string
>     * in order to avoid great number of pg_mblen calls.
>     */
>    prev = (int *) palloc(4 * m * sizeof(int));
>
I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
        temp = curr;
        curr = prev;
        prev = temp;

* There are some compiler warnings. Avoid them if possible.
> fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
> fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
> this function
> fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
> this function
> fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
> in this function
> fuzzystrmatch.c: In function ‘levenshtein_internal’:
> fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
> this function
>
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
> of 'm' is an integer.
>
Fixed.


> * Need to fix the caution in docs.
> http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
> | Caution: At present, fuzzystrmatch does not work well with
> | multi-byte encodings (such as UTF-8).
> but now levenshtein supports multi-byte encoding!  We should
> mention which function supports mbchars not to confuse users.
>
I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
>  Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
>  PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
>  Unpacking versions make the core a bit faster.
>
 Fixed.

With best regards,
Alexander Korotkov.

fuzzystrmatch-0.4.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

Reply via email to