Re: [PATCHES] CopyReadLineText optimization

Heikki Linnakangas Tue, 11 Mar 2008 03:26:51 -0700

Andrew Dunstan wrote:

Heikki Linnakangas wrote:
It looks like strpbrk() performs poorly:
Yes, not surprising. I just looked at the implementation in glibc, whichI assume you are using, and it seemed rather basic. The one in NetBSD'slibc looks much more efficient.
See
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc

There's architecture-specific optimized versions of that in glibc,though. For example, for x86_64:


http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

Not that what you've done isn't good, if a little obscure (as is theNetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with afew more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;splitting to fields is done later. We therefore only care about quotesand escapes when they affect the end of line detection. In text mode, weonly need to care about a backslash that precedes a LF/CR. Therefore,we could search for the next LF/CR character with memchr(), and check ifthe preceding character is a backslash (and if it is, check if there'syet another backslash behind it, and so forth until we hit anon-backslash character).


2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY dataconvert data newlines and carriage returns to the \n and \r sequencesrespectively. At present it is possible to represent a data carriagereturn by a backslash and carriage return, and to represent a datanewline by a backslash and newline. However, these representations mightnot be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggestswe could, we wouldn't need to care about backslashes in CopyReadLineTextin text mode at all.

3. If we did the client->server encoding conversion beforeCopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when theclient encoding is one not supported as a server encoding. That would bea big performance gain when that's the case, and would simplify the codein CopyReadLineText a little bit as well.


- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, andsplit the input block into fields in one loop. I would imagine thatbeing more efficient than scanning the input twice, no matter how fastwe make those scans.

There's a small difficulty with doing the conversion one input block ata time: there's no suitable interface in the conversion routines forthat. The input block boundary can be between 1st and 2nd byte of amulti-byte character, so we'd need to have a function that converts ablock of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much coderestructuring it needs and what the performance is like.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Re: [PATCHES] CopyReadLineText optimization

Reply via email to