2014-12-25 22:23 GMT+01:00 Alex Shulgin <a...@commandprompt.com>: > Trent Shipley <trent_ship...@qwest.net> writes: > > > On Friday 2007-12-14 16:22, Tom Lane wrote: > >> Neil Conway <ne...@samurai.com> writes: > >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > >> > to drop (and log) rows that contain malformed data. That is, rows with > >> > too many or too few columns, rows that result in constraint > violations, > >> > and rows containing columns where the data type's input function > raises > >> > an error. The last case is the only thing that would be a bit tricky > to > >> > implement, I think: you could use PG_TRY() around the > InputFunctionCall, > >> > but I guess you'd need a subtransaction to ensure that you reset your > >> > state correctly after catching an error. > >> > >> Yeah. It's the subtransaction per row that's daunting --- not only the > >> cycles spent for that, but the ensuing limitation to 4G rows imported > >> per COPY. > > > > You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This > > would help with the 4G subtransaction limit. The cost to the ETL > process is > > that a simple rollback would not be guaranteed send the process back to > it's > > initial state. There are easy ways to deal with the rollback issue > though. > > > > A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY > > option is selected then the COPY FROM can run without subtransactions > and in > > excess of the 4G per transaction limit. NO RETRY should be the default > since > > it preserves the legacy behavior of COPY FROM. > > > > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not > give the > > option of sending exceptions to a table since they are presumably > malformed, > > otherwise they would not be exceptions. (Users should re-process > exception > > files if they want an if good then table a else exception to table b ...) > > > > EXCEPTIONS TO and NO RETRY would be mutually exclusive. > > > > > >> If we could somehow only do a subtransaction per failure, things would > >> be much better, but I don't see how. > > Hello, > > Attached is a proof of concept patch for this TODO item. There is no > docs yet, I just wanted to know if approach is sane. > > The added syntax is like the following: > > COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] > > The way it's done it is abusing Copy Both mode and from my limited > testing, that seems to just work. The error trapping itself is done > using PG_TRY/PG_CATCH and can only catch formatting or before-insert > trigger errors, no attempt is made to recover from a failed unique > constraint, etc. > > Example in action: > > postgres=# \d test_copy2 > Table "public.test_copy2" > Column | Type | Modifiers > --------+---------+----------- > id | integer | > val | integer | > > postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; > 1 > NOTICE: missing data for column "val" > CONTEXT: COPY test_copy2, line 1: "1" > 2 > NOTICE: missing data for column "val" > CONTEXT: COPY test_copy2, line 2: "2" > 3 > NOTICE: missing data for column "val" > CONTEXT: COPY test_copy2, line 3: "3" > NOTICE: total exceptions ignored: 3 > > postgres=# \d test_copy1 > Table "public.test_copy1" > Column | Type | Modifiers > --------+---------+----------- > id | integer | not null > > postgres=# set client_min_messages to warning; > SET > postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; > ... > vmstat > zoneinfo > postgres=# > > Limited performance testing shows no significant difference between > error-catching and plain code path. For example, timing > > copy test_copy1 from program 'seq 1000000' [exceptions to stdout] > > shows similar numbers with or without the added "exceptions to" clause. > > Now that I'm sending this I wonder if the original comment about the > need for subtransaction around every loaded line still holds. Any > example of what would be not properly rolled back by just PG_TRY? >
this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal when block fails, then you can use row level subtransaction - it works well when you expect almost correct data. Regards Pavel > > Happy hacking! > -- > Alex > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >