Re: Fast COPY FROM based on batch insert

Ian Barwick Thu, 07 Jul 2022 19:12:49 -0700

On 07/07/2022 22:51, Andrey Lepikhov wrote:
> On 7/7/2022 06:14, Ian Barwick wrote:
>> 2022年3月24日(木) 15:44 Andrey V. Lepikhov <a.lepik...@postgrespro.ru>:
>>  >
>>  > On 3/22/22 06:54, Etsuro Fujita wrote:
>>  > > On Fri, Jun 4, 2021 at 5:26 PM Andrey Lepikhov
>>  > > <a.lepik...@postgrespro.ru> wrote:
>>  > >> We still have slow 'COPY FROM' operation for foreign tables in current
>>  > >> master.
>>  > >> Now we have a foreign batch insert operation And I tried to rewrite the
>>  > >> patch [1] with this machinery.
>>  > >
>>  > > The patch has been rewritten to something essentially different, but
>>  > > no one reviewed it.  (Tsunakawa-san gave some comments without looking
>>  > > at it, though.)  So the right status of the patch is “Needs review”,
>>  > > rather than “Ready for Committer”?  Anyway, here are a few review
>>  > > comments from me:
>>  > >
>>  > > * I don’t think this assumption is correct:
>>  > >
>>  > > @@ -359,6 +386,12 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo 
*miinfo,
>>  > > (resultRelInfo->ri_TrigDesc->trig_insert_after_row ||
>>  > >                    resultRelInfo->ri_TrigDesc->trig_insert_new_table))
>>  > >          {
>>  > > +           /*
>>  > > +            * AFTER ROW triggers aren't allowed with the foreign bulk 
insert
>>  > > +            * method.
>>  > > +            */
>>  > > +           Assert(resultRelInfo->ri_RelationDesc->rd_rel->relkind !=
>>  > > RELKIND_FOREIGN_TABLE);
>>  > > +
>>  > >
>>  > > In postgres_fdw we disable foreign batch insert when the target table
>>  > > has AFTER ROW triggers, but the core allows it even in that case.  No?
>>  > Agree
>>  >
>>  > > * To allow foreign multi insert, the patch made an invasive change to
>>  > > the existing logic to determine whether to use multi insert for the
>>  > > target relation, adding a new member ri_usesMultiInsert to the
>>  > > ResultRelInfo struct, as well as introducing a new function
>>  > > ExecMultiInsertAllowed().  But I’m not sure we really need such a
>>  > > change.  Isn’t it reasonable to *adjust* the existing logic to allow
>>  > > foreign multi insert when possible?
>>  > Of course, such approach would look much better, if we implemented it.
>>  > I'll ponder how to do it.
>>  >
>>  > > I didn’t finish my review, but I’ll mark this as “Waiting on Author”.
>>  > I rebased the patch onto current master. Now it works correctly. I'll
>>  > mark it as "Waiting for review".
>>
>> I took a look at this patch as it would a useful optimization to have.
>>
>> It applies cleanly to current HEAD, but as-is, with a large data set, it
>> reproducibly fails like this (using postgres_fdw):
>>
>>      postgres=# COPY foo FROM '/tmp/fast-copy-from/test.csv' WITH (format 
csv);
>>      ERROR:  bind message supplies 0 parameters, but prepared statement 
"pgsql_fdw_prep_19422" requires 6
>>      CONTEXT:  remote SQL command: INSERT INTO public.foo_part_1(t, v1, v2, 
v3, v4, v5) VALUES ($1, $2, $3, $4, $5, $6)
>>      COPY foo, line 17281589
>>
>> This occurs because not all multi-insert buffers being flushed actually 
contain
>> tuples; the fix is simply not to call ExecForeignBatchInsert() if that's the 
case,
>> e.g:
>>
>>
>>          /* Flush into foreign table or partition */
>>          do {
>>              int size = (resultRelInfo->ri_BatchSize < nused - sent) ?
>>                          resultRelInfo->ri_BatchSize : (nused - sent);
>>
>>              if (size)
>>              {
>>                  int inserted = size;
>>
>> resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert(estate,
>> resultRelInfo,
>> &slots[sent],
>>                                                                       NULL,
>> &inserted);
>>                  sent += size;
>>              }
>>          } while (sent < nused);
>>
>>
>> There might a case for arguing that the respective FDW should check that it 
has
>> actually received some tuples to insert, but IMHO it's much preferable to 
catch
>> this as early as possible and avoid a superfluous call.
>>
>> FWIW, with the above fix in place, with a simple local test the patch 
produces a
>> consistent speed-up of about 8 times compared to the existing functionality.
>
> Thank you for the attention to the patch.
> I have a couple of questions:
>
> 1. It's a problem for me to reproduce the case you reported. Can you give more
>    details on the reproduction?


The issue seems to occur when the data spans more than one foreign partition,
probably because the accumulated data for one partition needs to be flushed
before moving on to the next partition, but not all pre-allocated multi-insert
buffers have been filled.

The reproduction method I have, which is pared down from the original bulk 
insert
which triggered the error, is as follows:

1. Create some data using the attached script:

    perl data.pl > /tmp/data.csv

2. Create two nodes (A and B)

3. On node B, create tables as follows:

    CREATE TABLE foo_part_1 (t timestamptz, v1 int, v2 int, v3 int, v4 text, v5 
text);
    CREATE TABLE foo_part_2 (t timestamptz, v1 int, v2 int, v3 int, v4 text, v5 
text);

4. On node A, create FDW and partitioned table as follows:

    -- adjust parameters as appropriate

    CREATE EXTENSION postgres_fdw;

    CREATE SERVER pg_fdw
      FOREIGN DATA WRAPPER postgres_fdw
      OPTIONS (
        host 'localhost',
        port '6301',
        dbname 'postgres',
        batch_size '100'
     );

    CREATE USER MAPPING FOR CURRENT_USER SERVER pg_fdw
      OPTIONS(user 'postgres');

    -- create parition table and partitions

    CREATE TABLE foo (t timestamptz, v1 int, v2 int, v3 int, v4 text, v5 text)  
PARTITION BY RANGE(t);

    CREATE FOREIGN TABLE foo_part_1
       PARTITION OF foo
       FOR VALUES FROM ('2022-05-19 00:00:00') TO ('2022-05-20 00:00:00')
       SERVER pg_fdw;

    CREATE FOREIGN TABLE foo_part_2
       PARTITION OF foo
       FOR VALUES FROM ('2022-05-20 00:00:00') TO ('2022-05-21 00:00:00')
       SERVER pg_fdw;

5. On node A, load the previously generated data with COPY:

    COPY foo FROM '/tmp/data.csv' with (format 'csv');

This will fail like this:

    ERROR:  bind message supplies 0 parameters, but prepared statement 
"pgsql_fdw_prep_178" requires 6
    CONTEXT:  remote SQL command: INSERT INTO public.foo_part_1(t, v1, v2, v3, 
v4, v5) VALUES ($1, $2, $3, $4, $5, $6)
    COPY foo, line 88160


> 2. Have you tried to use previous version, based on bulk COPY machinery, not
>    bulk INSERT? > Which approach looks better and have better performance in
>    your opinion?

Aha, I didn't see that, I'll take a look.


Regards

Ian Barwick

--

EnterpriseDB - https://www.enterprisedb.com

data.pl
Description: Perl program

Re: Fast COPY FROM based on batch insert

Reply via email to