Thank you, Kevin -- this is helpful. But it still leaves questions for me.
Kevin Grittner <kgri...@ymail.com> wrote: > Alex Goncharov <alex.goncharov....@gmail.com> wrote: > > The whole thing is aborted then, and the good 99 records are not > > making it into the target table. > > Right. This is one reason people often batch such copies or check > the data very closely before copying in. How do I decide, before starting a COPY data load, whether such a load protection ("complexity") makes sense ("is necessary")? Clearly not needed for 1 MB of data in a realistic environment. Clearly is needed for loading 1 TB in a realistic environment. To put it differently: If I COPY 1 TB of data, what criteria should I use for choosing the size of the chunks to split the data into? For INSERT-loading, for the database client interfaces offering the array mode, the performance difference between loading 100 or 1000 rows at a time is usually negligible if any. Therefore 100- and 1000-row's array sizes are both reasonable choices. But what is a reasonable size for a COPY chunk? It can't even be measured in rows. Note, that if you have a 1 TB record-formatted file to load, you can't just split it in 1 MB chunks and feed them to COPY -- the file has to be split on the record boundaries. So, splitting the data for COPY is not a trivial operation, and if such splitting can be avoided, a reasonable operator will avoid it. But then again: when can it be avoided? > > My question is: Where are these 99 records have been living, on > > the database server, while the 100-th one hasn't come yet, and > > the need to throw the previous data accumulation away has not > > come yet? > > They will have been written into the table. They do not become > visible to any other transaction until and unless the inserting > transaction successfully commits. These slides may help: > > http://momjian.us/main/writings/pgsql/mvcc.pdf Yeah, I know about the MVCC model... The question is about the huge data storage to be reserved without a commitment while the load is not completed, about the size constrains in effect here. > > There have to be some limits to the space and/or counts taken by > > the new, uncommitted, data, while the COPY operation is still in > > progress. What are they? > > Primarily disk space for the table. How can that be found? Is "df /mount/point" the deciding factor? Or some 2^32 or 2^64 number? > If you are not taking advantage of the "unlogged load" optimization, > you will have written Write Ahead Log (WAL) records, too -- which > (depending on your configuration) you may be archiving. In that > case, you may need to be concerned about the archive space required. "... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB? If I am always concerned, and check something before a COPY, what should I be checking? What are the "OK-to-proceed" criteria? > If you have foreign keys defined for the table, you may get into > trouble on the RAM used to track pending checks for those > constraints. I would recommend adding any FKs after you are done > with the big bulk load. I am curious about the simplest case where only the data storage is to be worried about. (As an aside: the CHECK and NOT NULL constrains are not a storage factor, right?) > PostgreSQL does *not* have a "rollback log" which will impose a > limit. Something will though, right? What would that be? The available disk space on a file system? (I would be surprised.) > > Say, I am COPYing 100 TB of data and the bad records are close > > to the end of the feed -- how will this all error out? > > The rows will all be in the table, but not visible to any other > transaction. I see. How much data can I fit there while doing COPY? Not 1 TB? -- Alex On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner <kgri...@ymail.com> wrote: > Alex Goncharov <alex.goncharov....@gmail.com> wrote: > > > Suppose I COPY a huge amount of data, e.g. 100 records. > > > > My 99 records are fine for the target, and the 100-th is not -- > > it comes with a wrong record format or a target constraint > > violation. > > > > The whole thing is aborted then, and the good 99 records are not > > making it into the target table. > > Right. This is one reason people often batch such copies or check > the data very closely before copying in. > > > My question is: Where are these 99 records have been living, on > > the database server, while the 100-th one hasn't come yet, and > > the need to throw the previous data accumulation away has not > > come yet? > > They will have been written into the table. They do not become > visible to any other transaction until and unless the inserting > transaction successfully commits. These slides may help: > > http://momjian.us/main/writings/pgsql/mvcc.pdf > > > There have to be some limits to the space and/or counts taken by > > the new, uncommitted, data, while the COPY operation is still in > > progress. What are they? > > Primarily disk space for the table. If you are not taking > advantage of the "unlogged load" optimization, you will have > written Write Ahead Log (WAL) records, too -- which (depending on > your configuration) you may be archiving. In that case, you may > need to be concerned about the archive space required. If you have > foreign keys defined for the table, you may get into trouble on the > RAM used to track pending checks for those constraints. I would > recommend adding any FKs after you are done with the big bulk load. > > PostgreSQL does *not* have a "rollback log" which will impose a limit. > > > Say, I am COPYing 100 TB of data and the bad records are close > > to the end of the feed -- how will this all error out? > > The rows will all be in the table, but not visible to any other > transaction. Autovacuum will clean them out in the background, but > if you want to restart your load against an empty table it might be > a good idea to TRUNCATE that table; it will be a lot faster. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >