Re: Conflict handling for COPY FROM

David G. Johnston Wed, 08 Apr 2020 09:43:22 -0700

On Fri, Mar 27, 2020 at 3:27 PM Tom Lane <t...@sss.pgh.pa.us> wrote:

> Surafel Temesgen <surafel3...@gmail.com> writes:
> > [ conflict-handling-copy-from-v16.patch ]
>
> I took a quick look at this patch, since it was marked "ready for
> committer", but I don't see how it can possibly be considered committable.
>


[...]


>
> Looking at this patch in terms of whether the functionality is
> available in that approach, it seems like you might want two parts
> of it:
>
> 1. A COPY option to be flexible about the number of columns in the
> input, say by filling omitted columns with NULLs.
>
> 2. INSERT ... ON CONFLICT can be used to transfer data to permanent
> tables with rejection of duplicate keys, but it doesn't have much
> flexibility about just what to do with duplicates.  Maybe we could
> add more ON CONFLICT actions?  Sending conflicted rows to some other
> table, or updating the source table to show which rows were copied
> and which not, might be useful things to think about.
>

tl/dr: patch 1: change the option to something like "processing_mode = row
{default = file}" and relay existing errors across all rows in the table in
the error detail message.

tl/dr patch 2: add an "IGNORE_CONFLICTS = {-1, 0 default, 1+}" option that
converts the errors that appear for speculative insertions into warnings
(in the error detail message) and either aborts the copy (cleanly, no
inserted rows) if the count exceeds the user specified value) or commits if
it is able (i.e., no errors in the error message detail - which would come
from other problems besides conflicts).  I don't really get why a number is
needed here though - its not like ON CONFLICT DO NOTHING has that ability
and all this seems to be wanting to do is enable a subset of ON CONFLICT DO
NOTHING for COPY - in which case no warning or error would appear in the
first place.

Sorry for the rambling below, a decent chuck of the material is stream of
consciousness but it all generally relates to the behaviors that this patch
is touching.

My desired take-away from this would be to have COPY's error response
include one line for each input line that fails to be inserted with the
line number and the underlying error message passed along verbatim.
Importing, fixing one error, trying again, fixing another error, repeat is
a first level goal for me.  Including at most the first 30 characters of
the problematic line would facilitate the common case where the first field
is an identifier which is more useful that a pure line number.  But
including the entire line doesn't seem worth the risk.  Though it should be
considered whether to include the error detail of the underlying error
message - probably as a subsequent line.

After that, consider having a facility for telling COPY that you are OK if
it ignores the problem rows and inserts the ones it is able - and
converting the final exit code to be a warning instead of an error
(whatever that means, if anything, in this context).

The speculative insertion stuff is outside my experience but are we trying
to just make it work, give the COPY user control of whether its used or
not, have an independent facility just for copy, or something else?
Teaching copy to use speculative insertion infrastructure that already
exists is a patch in its own right and seems to be fixing a current POLA
violation - this other stuff about reporting and ignoring errors is besides
the point.  The basic inter-operability fix seems like it could be made to
work under today's copy error handling and report framework just fine.

If there is value in someone augmenting the error handling and response
framework to work in a tabular format instead of a free-form error message
text body that too could be presented and discussed as its own commit.
Granted it may be the case that such a verbose error message from COPY as
described above is undesirable but would be palatable in some other
presentation format.  I'm inclined to believe, however, that even if COPY
is made an exception to the general error message rules (and the detail
information would be part of the errdetail, not the main message, it could
just include a summary count) that the added functionality would make the
exception acceptable.

To be honest I haven't dived into the assumptions baked into the regression
tests to know whether the tabular stuff that was discussed and that is
shown in the regression expected outputs is normal or not.  I assume the
part in question is:

+COPY x from stdin WITH(ERROR_LIMIT 5);
+WARNING:  skipping "70001 22 32" --- missing data for column "d"
+WARNING:  skipping "70002 23 33 43 53 54" --- extra data after last
expected column
+WARNING:  skipping "70003 24 34 44" --- missing data for column "e"
+
+   a   | b  | c  | d  |          e
+-------+----+----+----+----------------------
+ 70005 | 27 | 37 | 47 | before trigger fired
+(1 row)

Right now COPY would just fail.  What would be nice is for it to say:

ERROR: Line 1?  missing data for column "d"
ERROR: Line 2?  --- extra data after last expected column
ERROR: Line 3?  --- missing data for column "e"

Making process go as far along as possible, to get better (more specific)
messages from deeper in the system, but still just saying "ERROR: Line #
failed"

Then,

WARNING: ...
WARNING: ...
WARNING: ...

COPY 1

But I'm not even sure why the tabular representation is showing up in that
stanza...

>From the OP:

"If you assume that users re-execute COPY FROM with the output lines as
input, these columns are obstacles."

I really don't care right now to make that assumption nor is facilitating
it necessarily a goal.  While this may be the current motivation there are
other parts that are useful in their own right whether or not we proceed
toward this final outcome.  As Tom said its unclear whether its realistic
to make this work directly and there is no documentation that describes how
this patch would support that use case.   I haven't delved that deeply into
this potential aspect of the patch but it is something worth of its own
patch on a thread with an appropriate title.

Given the current thread's title (conflict handling - not clear on whether
"make it work" or "response alterations when it doesn't" is the goal) all
this should be doing is making it so errors don't happen when all rows
would be inserted correctly if existing speculative insertion routines were
being used for the exact same data coming from a table and moved via INSERT
(ON CONFLICT DO NOTHING).  That implies that any issues in populating said
equivalent table are out of scope for this patch - as is changing the error
handling mechanics.  I don't know if, with those pieces cut out, whether
the core of the patch with speculative insertions is ready to commit.

Looking only at the patch it seems as if speculative insertion is only
attempted if we are willing to ignore errors.  Thus ignoring errors is an
indirect way of saying that you want to bypass the bulk insertion
performance trade-offs and once you do that you might as well check for
conflicts and let them resolve.  It is desirable, though, to be able to
turn off those performance trade-offs simply in order to attempt to insert
every single row and be informed of all the errors in the input in one
pass.  Future flags like ERROR_LIMIT would then simply imply that state.

With the two tl/dr patches in place features like making actions besides
"DO NOTHING" (which is what "ignore" basically does but you can choose a
maximum count to accept a do nothing outcome) can be considered.  Also,
adding an option to "IGNORE_PARSING_ERRORS" or IGNORE_STRUCTURE_ERRORS" (or
creating an enum and a multi-valued option...).  I don't have a problem
with limited scope here but the naming should be chosen to match.  You name
something "ERRORS" and it should cover everything.  In addition, being more
precise, besides aiding development scope, gives the user the ability to
choose which kinds of errors that are willing to ignore and which they
really want to be fatal.

David J.

Re: Conflict handling for COPY FROM

Reply via email to