[HACKERS] like/ilike improvements

Andrew Dunstan Tue, 22 May 2007 09:01:48 -0700

Starting from a review of a patch from Itagaki Takahiro to improve LIKEperformance for UTF8-encoded databases, I have been working on improvingboth efficiency of the LIKE/ILIKE code and the code quality.

The main efficiency improvement comes from some fairly tricky analysisand discussion on -patches. Essentially there are two calls that we maketo advance the text and pattern cursors: NextByte and NextChar. In thecase of single byte charsets these are in fact the same thing, but inmulti byte charsets they are obviously not, and in that case NextChar isa lot more expensive. It turns out (according to the analysis) that theonly time we actually need to use NextChar is when we are matching an"_" in a like/ilike pattern. It also turns out that there are somecomparison tests that we can hoist out of a loop and thus avoidrepeating over and over. Also, some calls can be marked "inline" toimprove efficiency. Finally, the special case of computing lower(x) onthe fly for ILIKE comparisons on single byte charset strings turns outto have the potential to call lower() O(n^2) times, so it has beenremoved and we now treat foo ILIKE bar as lower(foo) LIKE lower(bar) forall charsets uniformly. There will be cases where this approach wins andcases where it loses, but the wins are potentially dramatic, whereas thelosses should be mild.

The current state of this work is athttp://archives.postgresql.org/pgsql-patches/2007-05/msg00385.php

I've been testing it using a set of 5m rows of random Latin1 data - eachrow is between 100 and 400 chars long, and 20% of them (roughly) havethe string "foo" randomly located within them. The test platform isgcc/fc6/AMD64.

I have loaded the data into both Latin1 and UTF8 encoded databases. (I'mnot sure if there are other multibyte charsets that are compatible withLatin1 client encoding). My test is essentially:


 select count(*) from footable where t like '%_foo%';
 select count(*) from footable where t ilike '%_foo%';

 select count(*) from footable where t like '%foo%';
 select count(*) from footable where t ilike '%foo%';

Note that the "%_" case is probably the worst for these changes, sinceit involves lots of calls to NextChar() (see above).

The multibyte results show significant improvement. The results areabout flat or a slight improvement for the singlebyte cases. I'll postsome numbers on this shortly.

But before I commit this I'd appreciate seeing some more testing, bothfor correctness and performance.


cheers

andrew










---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

[HACKERS] like/ilike improvements

Reply via email to