Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Peter Geoghegan Thu, 08 Jun 2017 15:10:57 -0700

On Wed, Jun 7, 2017 at 12:34 PM, Alex K <kondratov.alek...@gmail.com> wrote:
> (1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
> UPSERT.


> It may be a good point to be able to achieve the same functionality
> as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
> and errors handling is turned on. It could additionally reduce number of
> failed
> subtransactions and reduce XIDs consumption, while still ignoring some
> common
> errors like unique index violation.

Alvaro and I talked about this informally at PGCon.

> Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
> to be a large separated task and is out of the current project scope, but
> maybe there is
> a relatively simple way to somehow perform internally tuples insert with
> ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
> I understand he is the major contributor of UPSERT in PostgreSQL. It would
> be great
> if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.

You seem to be talking about doing this as an optimization on top of a
base feature that does the main thing you want (captures all errors
within an implementation level subxact without passing them to the
client). That could make sense, as a way of preventing extreme bloat
for a very bad case where almost all inserts have conflicts. (This
seems quite possible, whereas it seems much less likely that users
would have an input file simple full of illformed tuples.)

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow. I'm not sure if it's
possible to avoid using subtransactions all together, but speculative
insertion would help if you find that you can do it without
subtransactions. Using subtransactions is always going to be a bit
ugly, because you'll need to continually reassess whether or not
you're batching insertions together at the right granularity (that is,
that you've weighed the rate of XID consumption against how much work
you lose when a batched transaction has to be "replayed" to include
things that are known to be valid). And, if you care about duplicate
violations, then you can't really be sure that replaying a "known
good" tuple will stay good from one moment to the next.

My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types. Is there really much value in
ignoring errors due to invalid encoding? It's not as if such problems
can be reliably detected today. If you use the wrong encoding, and
ignore some errors that COPY would generally raise, then there is an
excellent chance that you'll still insert some remaining rows with
text that has been incorrectly interpreted as valid in the database
encoding -- some text datums are bound to accidentally appear valid.
There are probably similar issues with other types. It's not clear
what the point is at which the user is no longer helped by ignoring
problems, because we cannot reliably detect *all* problems at the
level of each row.

If you must ignore errors within the input functions of types, then
maybe you can optionally let the user do that by way of a "dry run",
where the entire input file is examined for basic structural soundness
ahead of considering constraints. Any errors are saved then and there,
in a format that can be used to make sure that those entries are
skipped on a later COPY. As a further enhancement, in the future, the
user might then be able to define special transform functions that
correct the errors for those rows only. You kind of need to see all
the faulty rows together to do something like that, so a dry run could
make a lot of sense.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Reply via email to