Re: [HACKERS] COPY enhancements

Emmanuel Cecchet Thu, 10 Sep 2009 14:32:39 -0700

Josh,

See the answers inlined.

Thank you for tackling this very long-time TODO.

Error logging is described here:
http://wiki.postgresql.org/wiki/Error_logging_in_COPY


Questions & Comments:

A) Why would someone want to turn error_logging on, but leave
error_logging_skip_tuples off?  The pg_log already logs errors which copy 
throws by default.

When error_logging is on and skip_tuples is off, errors are logged inthe error table. If skip_tuples is on, tuples are not logged in theerror table.

B) As I mentioned earlier, we'll want to provide the option of logging
to a file instead of to a table.  That's not a reason to reject this
patch, but probably a TODO for 8.5.

Ok but what should be the format of that file?

C) Are we sure we want to handle this via GUCs rather than extensions to
COPY syntax?  It seems like fairly often users would want to log
different COPY sources to different tables/files.

I agree that new COPY options could be easier to use, the implementationis just more complex. However, the labels allows you to select thetuples related to specific COPY commands.

D) These GUCs are userset, I hope?  (haven't dug into the code far
enough to tell yet).

Yes.

E) What is error_logging_tuple_label for?   You don't explain/give
examples.  And how is error_logging_tuple_partition_key used?

We use the label and partition key in Aster products to easily retrievewhich COPY command on which partition did generate the bad tuples. Bydefault, the tuple_label contains the COPY command that was executed(see example on Wiki) and the key contains the index of the tuple in thesource file (see example on Wiki).

F) Rawdata for rejected tuples is presumably BYTEA?

Yes. I forgot to put back the table description that can be seen in theunit tests. I have updated the Wiki with the table definition.

G) We should probably have a default for error_logging_table_name, such
as pg_copy_errors.  Does that table get automatically created if it
doesn't exist?

Yes, as indicated on the wiki the table is created automatically (seeconfig variable section).

H) Finally, one request of the TODO is some way to halt import after a
specified number of bad tuples because it probably means you have the
wrong file or wrong table.  Do we still want that?

We can still do that. It can be another GUC variable or an option toCOPY. If the COPY command fails, everything gets rolled back (data inthe destination table and error table). That would be harder toimplement with a file (the rollback part).

Autopartitioning is described here:
http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY


M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.

Ok.

N) Have you measured the overhead & speed of this kind of COPY as
opposed to COPY into a single table?  Have you checked the overhead if
tuple_routing_in_copy is on, but you are not loading into a partitioned
table?

Yes. There is no noticeable overhead if there is no routing to do (butrouting is on).If routing is involved, the overhead depends on how sorted your inputdata is. If it all goes to the same partition, the caching effect workswell and there is no noticeable overhead. The cost is in the constraintcheck and it depends on the complexity of the constraint. The moreconstraints you have to check and the more complex they are, the moreoverhead on each tuple routing.

O) Is this capable of dealing with partitioning by more than one column,
or by an expression?

Yes, we just use a brute force technique where we try all child tables1-by-1 and rely on the existing Postgres constraint checking mechanism(no new or duplicated code there).

Finally, I'm going to suggest different names for the GUCs, as the names
you've chosen don't group well and would likely cause confusion.  Here
are my suggestions, which all begin with "copy_" for prefix matching:

error_logging --> probaby not needed, see able
error_logging_skip_tuples --> copy_skip_bad_rows
error_logging_schema_name --> copy_logging_schema_name
error_logging_relation_name --> copy_logging_table_name
error_logging_tuple_label --> don't know what this is for, see above
error_logging_tuple_partition_key --> don't know what this is for, see above

tuple_routing_in_copy --> copy_partitioning
tuple_routing_cache_size --> copy_partitioning_cache_size

This makes sense. I'll add that on my todo list.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY enhancements

Reply via email to