Andrew Dunstan wrote:
Heikki Linnakangas wrote:
It looks like strpbrk() performs poorly:

Yes, not surprising. I just looked at the implementation in glibc, which I assume you are using, and it seemed rather basic. The one in NetBSD's libc looks much more efficient.

See

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc

There's architecture-specific optimized versions of that in glibc, though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

Not that what you've done isn't good, if a little obscure (as is the NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand, it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line; splitting to fields is done later. We therefore only care about quotes and escapes when they affect the end of line detection. In text mode, we only need to care about a backslash that precedes a LF/CR. Therefore, we could search for the next LF/CR character with memchr(), and check if the preceding character is a backslash (and if it is, check if there's yet another backslash behind it, and so forth until we hit a non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests we could, we wouldn't need to care about backslashes in CopyReadLineText in text mode at all.

3. If we did the client->server encoding conversion before CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the client encoding is one not supported as a server encoding. That would be a big performance gain when that's the case, and would simplify the code in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and split the input block into fields in one loop. I would imagine that being more efficient than scanning the input twice, no matter how fast we make those scans.

There's a small difficulty with doing the conversion one input block at a time: there's no suitable interface in the conversion routines for that. The input block boundary can be between 1st and 2nd byte of a multi-byte character, so we'd need to have a function that converts a block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code restructuring it needs and what the performance is like.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Reply via email to