Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Jeff Davis Tue, 26 Aug 2014 22:15:07 -0700

On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
> Actually, that gets optimized to a constant in the planner:


Oops, thank you (and Tom).

> your patch seems to be about 2x-3x as slow as unpatched master. So this 
> needs some optimization. A couple of ideas:

I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?

> 1. If the escape string is in fact a single-byte character, you can 
> proceed with the loop just as it is today, without the pg_mblen calls.
> 
> 2. Since pg_mblen() will always return an integer between 1-6, it would 
> probably be faster to replace the memcpy() and memcmp() calls with 
> simple for-loops iterating byte-by-byte.
> 
> In very brief testing, with the 1. change above, the performance with 
> this patch is back to what it's without the patch. See attached.

The particular patch has a mistake: the first branch is always taken
because pg_mblen() won't return 0. It's also fairly ugly to set mblen in
the test for the branch that doesn't use it.

Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.

Regards,
        Jeff Davis

*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
***************
*** 688,698 **** similar_escape(PG_FUNCTION_ARGS)
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else if (elen != 1)
! 			ereport(ERROR,
! 					(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 					 errmsg("invalid escape string"),
! 				  errhint("Escape string must be empty or one character.")));
  	}
  
  	/*----------
--- 688,703 ----
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else
! 		{
! 			int			escape_mblen = pg_mbstrlen_with_len(e, elen);
! 
! 			if (escape_mblen > 1)
! 				ereport(ERROR,
! 						(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 						 errmsg("invalid escape string"),
! 						 errhint("Escape string must be empty or one character.")));
! 		}
  	}
  
  	/*----------
***************
*** 724,729 **** similar_escape(PG_FUNCTION_ARGS)
--- 729,782 ----
  	{
  		char		pchar = *p;
  
+ 		/*
+ 		 * If both the escape character and the current character from the
+ 		 * pattern are multi-byte, we need to take the slow path.
+ 		 *
+ 		 * But if one of them is single-byte, we can process the the pattern
+ 		 * one byte at a time, ignoring multi-byte characters.  (This works
+ 		 * because all server-encodings have the property that a valid
+ 		 * multi-byte character representation cannot contain the
+ 		 * representation of a valid single-byte character.)
+ 		 */
+ 
+ 		if (elen > 1)
+ 		{
+ 			int mblen = pg_mblen(p);
+ 			if (mblen > 1)
+ 			{
+ 				/* slow, multi-byte path */
+ 				if (afterescape)
+ 				{
+ 					*r++ = '\\';
+ 					memcpy(r, p, mblen);
+ 					r += mblen;
+ 					afterescape = false;
+ 				}
+ 				else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
+ 				{
+ 					/* SQL99 escape character; do not send to output */
+ 					afterescape = true;
+ 				}
+ 				else
+ 				{
+ 					/*
+ 					 * We know it's a multi-byte character, so we don't need
+ 					 * to do all the comparisons to single-byte characters
+ 					 * that we do below.
+ 					 */
+ 					memcpy(r, p, mblen);
+ 					r += mblen;
+ 				}
+ 
+ 				p += mblen;
+ 				plen -= mblen;
+ 
+ 				continue;
+ 			}
+ 		}
+ 
+ 		/* fast path */
  		if (afterescape)
  		{
  			if (pchar == '"' && !incharclass)	/* for SUBSTRING patterns */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Reply via email to