Russell Smith wrote:
Hi,

As I'm interested in this topic, I thought I'd take a look at the
patch.  I have no capability to test it on high end hardware but did
some basic testing on my workstation and basic review of the patch.

I somehow had the impression that instead of creating a new connection
for each restore item we would create the processes at the start and
then send them the dumpId's they should be restoring.  That would allow
the controller to batch dumpId's together and expect the worker to
process them in a transaction.  But this is probably just an idea I
created in my head.

Yes it is. To do that I would have to invent a protocol for talking to the workers, etc, and there is not the slightest chance I would get that done by November. And I don't see the virtue in processing them all in a transaction. I've provided a much simpler means of avoiding WAL logging of the COPY.

Do we know why we experience "tuple concurrently updated" errors if we
spawn thread too fast?

No. That's an open item.
I completed some test restores using the pg_restore from head with the
patch applied.  The dump was a custom dump created with pg 8.2 and
restored to an 8.2 database.  To confirm this would work, I completed a
restore using the standard single threaded mode.   The schema restore
successfully.  The only errors reported involved non-existent roles.

When I attempt to restore using parallel restore I get out of memory
errors reported from _PrintData.   The code returning the error is;

_PrintData(...
    while (blkLen != 0)
    {
        if (blkLen + 1 > ctx->inSize)
        {
            free(ctx->zlibIn);
            ctx->zlibIn = NULL;
            ctx->zlibIn = (char *) malloc(blkLen + 1);
            if (!ctx->zlibIn)
                die_horribly(AH, modulename, " out of memory\n");

            ctx->inSize = blkLen + 1;
            in = ctx->zlibIn;
        }


It appears from my debugging and looking at the code that in _PrintData;
    lclContext *ctx = (lclContext *) AH->formatData;

the memory context is shared across all threads.  Which means that it's
possible the memory contexts are stomping on each other.  My GDB skills
are now up to being able to reproduce this in a gdb session as there are
forks going on all over the place.  And if you process them in a serial
fashion, there aren't any errors. I'm not sure of the fix for this. But in a parallel environment it doesn't seem possible to store the
memory context in the AH.


There are no threads, hence nothing is shared. fork() create s new process, not a new thread, and all they share are file descriptors.


I also receive messages saying "pg_restore: [custom archiver] could not
read from input file: end of file".  I have not investigated these
further as my current guess is they are linked to the out of memory error.

Given I ran into this error at my first testing attempt  I haven't
evaluated much else at this point in time.  Now all this could be
because I'm using the 8.2 archive, but it works fine in single restore
mode.  The dump file is about 400M compressed and an entire archive
schema was removed from the restore path with a custom restore list.

Command line used;  PGPORT=5432 ./pg_restore -h /var/run/postgresql -m4
--truncate-before-load -v -d tt2 -L tt.list
/home/mr-russ/pg-index-test/timetable.pgdump 2> log.txt

I've attached the log.txt file so you can review the errors that I saw. I have adjusted the "out of memory" error to include a number to work
out which one was being triggered.  So you'll see "5 out of memory" in
the log file, which corresponds to the code above.

However, there does seem to be something odd happening with the compression lib, which I will investigate. Thanks for the report.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to