Re: [HACKERS] multibyte charater set in levenshtein function

Itagaki Takahiro Sun, 11 Jul 2010 21:46:53 -0700

Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304


The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very
  similar. Can you merge the code? We can always use less_equal_internal()
  if the overhead is ignorable. Did you compare them?

* There are many "if (!multibyte_encoding)" in levenshtein_internal().
  How about split the function into two funcs for single-byte chars and
  multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
  use multi-byte version if the overhead is small.

* I prefer a struct rather than an array.  "4 * m" and "3 * m" look like magic
  numbers for me. Could you name the entries with definition of a struct?
    /*
     * For multibyte encoding we'll also store array of lengths of
     * characters and array with character offsets in first string
     * in order to avoid great number of pg_mblen calls.
     */
    prev = (int *) palloc(4 * m * sizeof(int));

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding!  We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...)
  Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
  PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
  Unpacking versions make the core a bit faster.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

Reply via email to