2010/8/2 Alexander Korotkov <aekorot...@gmail.com>: > On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmh...@gmail.com> wrote: >> I reviewed this code in a fair amount of detail today and ended up >> rewriting it. In general terms, it's best to avoid changing things >> that are not relevant to the central purpose of the patch. This patch >> randomly adds a whole bunch of whitespace and removes a whole bunch of >> comments which I find useful and do not want to see removed. It took >> me about an hour to reverse out all of those changes, which then let >> me get down to the business of actually analyzing the code. > I'm sorry. This is my first patch, I will be more accurate next time. But, I > think there is no unified opinion about some changes like replacement "!m" > with "m==0".
Sure; if that were the only such change, I wouldn't have mentioned it. > I think this line is not correct: > "if (m != s_bytes && n != t_bytes)" Good catch, fixed in the attached version. >> The attached patch takes the approach of using a fast-path for just >> the innermost loop when neither string contains any multi-byte >> characters (regardless of whether or not the encoding allows >> multi-byte characters). It's still slower than CVS HEAD, but for >> strings containing no multi-byte characters it's only marginally, if >> at all, slower than previous releases, because 9.0 doesn't contain >> your fix to avoid the extra string copy; and it's faster than your >> implementation even when multi-byte characters ARE present. It is >> slightly slower on single-byte encoding databases (I tested >> SQL_ASCII), but still faster than previous releases. It's also quite >> a bit less code duplication. > > Can I look at the test which shows that your implementation is faster than > my when multi-byte characters are present? My tests show reverse. For > example, with russian dictionary of openoffice: Hmm, with the correction you mention above, I'm getting about the same results with multi-byte characters (but faster with only single-byte characters). I did this: CREATE TABLE words (a text not null primary key); COPY words from '/usr/share/dict/words'; SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; With the updated patch attached below, these ran about 102 ms, 222 ms, 129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps this is because I only have multi-byte characters in one of the inputs, not both. I don't have a dictionary handy with multi-byte words in it. Can you send me a file? > I think that the cause of slow down slowdown is memcmp function. This > function is very fast for long parts of memory, but of few bytes inline > function is faster, because compiler have more freedom for optimization. > After replacing memcmp with inline function like following in your > implementation: Yeah, that does seem to be slightly better. I've done a version of this in the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
mb_levenshtein_rmh-v2.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers