On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
> Actually, that gets optimized to a constant in the planner:
Oops, thank you (and Tom).
> your patch seems to be about 2x-3x as slow as unpatched master. So this
> needs some optimization. A couple of ideas:
I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?
> 1. If the escape string is in fact a single-byte character, you can
> proceed with the loop just as it is today, without the pg_mblen calls.
>
> 2. Since pg_mblen() will always return an integer between 1-6, it would
> probably be faster to replace the memcpy() and memcmp() calls with
> simple for-loops iterating byte-by-byte.
>
> In very brief testing, with the 1. change above, the performance with
> this patch is back to what it's without the patch. See attached.
The particular patch has a mistake: the first branch is always taken
because pg_mblen() won't return 0. It's also fairly ugly to set mblen in
the test for the branch that doesn't use it.
Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.
I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.
Regards,
Jeff Davis
*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
***************
*** 688,698 **** similar_escape(PG_FUNCTION_ARGS)
elen = VARSIZE_ANY_EXHDR(esc_text);
if (elen == 0)
e = NULL; /* no escape character */
! else if (elen != 1)
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! errmsg("invalid escape string"),
! errhint("Escape string must be empty or one character.")));
}
/*----------
--- 688,703 ----
elen = VARSIZE_ANY_EXHDR(esc_text);
if (elen == 0)
e = NULL; /* no escape character */
! else
! {
! int escape_mblen = pg_mbstrlen_with_len(e, elen);
!
! if (escape_mblen > 1)
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! errmsg("invalid escape string"),
! errhint("Escape string must be empty or one character.")));
! }
}
/*----------
***************
*** 724,729 **** similar_escape(PG_FUNCTION_ARGS)
--- 729,782 ----
{
char pchar = *p;
+ /*
+ * If both the escape character and the current character from the
+ * pattern are multi-byte, we need to take the slow path.
+ *
+ * But if one of them is single-byte, we can process the the pattern
+ * one byte at a time, ignoring multi-byte characters. (This works
+ * because all server-encodings have the property that a valid
+ * multi-byte character representation cannot contain the
+ * representation of a valid single-byte character.)
+ */
+
+ if (elen > 1)
+ {
+ int mblen = pg_mblen(p);
+ if (mblen > 1)
+ {
+ /* slow, multi-byte path */
+ if (afterescape)
+ {
+ *r++ = '\\';
+ memcpy(r, p, mblen);
+ r += mblen;
+ afterescape = false;
+ }
+ else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
+ {
+ /* SQL99 escape character; do not send to output */
+ afterescape = true;
+ }
+ else
+ {
+ /*
+ * We know it's a multi-byte character, so we don't need
+ * to do all the comparisons to single-byte characters
+ * that we do below.
+ */
+ memcpy(r, p, mblen);
+ r += mblen;
+ }
+
+ p += mblen;
+ plen -= mblen;
+
+ continue;
+ }
+ }
+
+ /* fast path */
if (afterescape)
{
if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers