Andrew Dunstan wrote:
Heikki Linnakangas wrote:
It looks like strpbrk() performs poorly:
Yes, not surprising. I just looked at the implementation in glibc, which
I assume you are using, and it seemed rather basic. The one in NetBSD's
libc looks much more efficient.
There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:
Not that what you've done isn't good, if a little obscure (as is the
Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.
We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:
1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
2. The manual has this to say about escaping newlines with a backslash:
"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."
If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.
3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:
- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.
- The conversion might be faster, as it could operate on bigger blocks
- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.
There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.
I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.
Sent via pgsql-patches mailing list (email@example.com)
To make changes to your subscription: