Thank you, Kevin -- this is helpful.

But it still leaves questions for me.

Kevin Grittner <kgri...@ymail.com> wrote:

> Alex Goncharov <alex.goncharov....@gmail.com> wrote:

> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right.  This is one reason people often batch such copies or check
> the data very closely before copying in.

How do I decide, before starting a COPY data load, whether such a load
protection ("complexity") makes sense ("is necessary")?

Clearly not needed for 1 MB of data in a realistic environment.

Clearly is needed for loading 1 TB in a realistic environment.

To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?

For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any.  Therefore 100- and
1000-row's array sizes are both reasonable choices.

But what is a reasonable size for a COPY chunk?  It can't even be
measured in rows.

Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.

So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.

But then again: when can it be avoided?

> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table.  They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits.  These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf

Yeah, I know about the MVCC model...  The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.

> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress.  What are they?
>
> Primarily disk space for the table.

How can that be found?  Is "df /mount/point" the deciding factor? Or
some 2^32 or 2^64 number?

> If you are not taking advantage of the "unlogged load" optimization,
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving.  In that
> case, you may need to be concerned about the archive space required.

"... may need to be concerned ..." if what?  Loading 1 MB? 1 GB? 1 TB?

If I am always concerned, and check something before a COPY, what
should I be checking?  What are the "OK-to-proceed" criteria?

> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints.  I would recommend adding any FKs after you are done
> with the big bulk load.

I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)

> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.

Something will though, right?  What would that be? The available disk
space on a file system? (I would be surprised.)

> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.

I see.  How much data can I fit there while doing COPY?  Not 1 TB?

-- Alex



On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner <kgri...@ymail.com> wrote:

> Alex Goncharov <alex.goncharov....@gmail.com> wrote:
>
> > Suppose I COPY a huge amount of data, e.g. 100 records.
> >
> > My 99 records are fine for the target, and the 100-th is not --
> > it comes with a wrong record format or a target constraint
> > violation.
> >
> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right.  This is one reason people often batch such copies or check
> the data very closely before copying in.
>
> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table.  They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits.  These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf
>
> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress.  What are they?
>
> Primarily disk space for the table.  If you are not taking
> advantage of the "unlogged load" optimization, you will have
> written Write Ahead Log (WAL) records, too -- which (depending on
> your configuration) you may be archiving.  In that case, you may
> need to be concerned about the archive space required.  If you have
> foreign keys defined for the table, you may get into trouble on the
> RAM used to track pending checks for those constraints.  I would
> recommend adding any FKs after you are done with the big bulk load.
>
> PostgreSQL does *not* have a "rollback log" which will impose a limit.
>
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.  Autovacuum will clean them out in the background, but
> if you want to restart your load against an empty table it might be
> a good idea to TRUNCATE that table; it will be a lot faster.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Reply via email to