Tom Lane wrote:
I wrote:
Actually, I have to take back that objection: on closer look, COPY
validates the data only once and does so before applying its own
backslash-escaping rules.  So there is a risk in that path too.

It's still pretty annoying to be validating the data twice in the
common case where no backslash reduction occurred, but I'm not sure
I see any good way to avoid it.

Further thought here: if we put encoding verification into textin()
and related functions, could we *remove* it from COPY IN, in the common
case where client and server encodings are the same?  Currently, copy.c
forces a trip through pg_client_to_server for multibyte encodings
even when the encodings are the same, so as to perform validation.
But I'm wondering whether we'd still need that.  There's no risk of
SQL injection in COPY data.  Bogus input encoding could possibly
make for confusion about where the field boundaries are, but bad
data is bad data in any case.

                        regards, tom lane

Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble.


Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms

after adding suggested extra test to textin():

Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms

Script is:

create table xyz (x text);
copy xyz from '/tmp/';
truncate xyz;
copy xyz from '/tmp/';
truncate xyz;
copy xyz from '/tmp/';
truncate xyz;
copy xyz from '/tmp/';
truncate xyz;
copy xyz from '/tmp/';
truncate xyz;
copy xyz from '/tmp/';
drop table xyz;

Test platform: FC6, Athlon64.



