Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Alexey Kondratov Mon, 27 Mar 2017 15:55:37 -0700

Pavel, Craig and Stas,

Thank you for your responses and valuable comments!


I have written draft proposal 
https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
 
<https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit>

It seems that COPY currently is able to return first error line and error type 
(extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised 
for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.


Alexey


> On 23 Mar 2017, at 17:24, Stas Kelvich <stas.kelv...@gmail.com> wrote:
> 
>> 
>> On 23 Mar 2017, at 15:53, Craig Ringer <cr...@2ndquadrant.com> wrote:
>> 
>> On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.alek...@gmail.com> 
>> wrote:
>> 
>>> (1) Add errors handling to COPY as a minimum program
>> 
>> Huge +1 if you can do it in an efficient way.
>> 
>> I think the main barrier to doing so is that the naïve approach
>> creates a subtransaction for every row, which is pretty dire in
>> performance terms and burns transaction IDs very rapidly.
>> 
>> Most of our datatype I/O functions, etc, have no facility for being
>> invoked in a mode where they fail nicely and clean up after
>> themselves. We rely on unwinding the subtransaction's memory context
>> for error handling, for releasing any LWLocks that were taken, etc.
>> There's no try_timestamptz_in function or anything, just
>> timestamptz_in, and it ERROR's if it doesn't like its input. You
>> cannot safely PG_TRY / PG_CATCH such an exception and continue
>> processing to, say, write another row.
>> 
>> Currently we also don't have a way to differentiate between
>> 
>> * "this row is structurally invalid" (wrong number of columns, etc)
>> * "this row is structually valid but has fields we could not parse
>> into their data types"
>> * "this row looks structurally valid and has data types we could
>> parse, but does not satisfy a constraint on the destination table"
>> 
>> Nor do we have a way to write to any kind of failure-log table in the
>> database, since a simple approach relies on aborting subtransactions
>> to clean up failed inserts so it can't write anything for failed rows.
>> Not without starting a 2nd subxact to record the failure, anyway.
> 
> If we are optimising COPY for case with small amount of bad rows
> than 2nd subtransaction seems as not a bad idea. We can try to
> apply batch in subtx, if it fails on some row N then insert rows [1, N)
> in next subtx, report an error, commit subtx. Row N+1 can be treated
> as beginning of next batch.
> 
> 
> But if there will be some problems with handling everything with
> subtransaction and since parallelism is anyway mentioned, what about
> starting bgworker that will perform data insertion and will be controlled
> by backend?
> 
> For example backend can do following:
> 
> * Start bgworker (or even parallel worker) 
> * Get chunk of rows out of the file and try to apply them in batch
> as subtransaction in bgworker.
> * If it fails than we can open transaction in backend itself and
> raise notice / move failed rows to special errors table.
> 
>> So, having said why it's hard, I don't really have much for you in
>> terms of suggestions for ways forward. User-defined data types,
>> user-defined constraints and triggers, etc mean anything involving
>> significant interface changes will be a hard sell, especially in
>> something pretty performance-sensitive like COPY.
>> 
>> I guess it'd be worth setting out your goals first. Do you want to
>> handle all the kinds of problems above? Malformed  rows, rows with
>> malformed field values, and rows that fail to satisfy a constraint? or
>> just some subset?
>> 
>> 
>> 
>> -- 
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>> 
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org 
>> <mailto:pgsql-hackers@postgresql.org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers 
>> <http://www.postgresql.org/mailpref/pgsql-hackers>

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Reply via email to