Re: [HACKERS] Parallel pg_dump's error reporting doesn't work worth squat

Kyotaro HORIGUCHI Tue, 31 May 2016 23:10:52 -0700

At Tue, 31 May 2016 12:29:50 -0400, Tom Lane <[email protected]> wrote in 
<[email protected]>
> Kyotaro HORIGUCHI <[email protected]> writes:
> > At Fri, 27 May 2016 13:20:20 -0400, Tom Lane <[email protected]> wrote in 
> > <[email protected]>
> >> Kyotaro HORIGUCHI <[email protected]> writes:
> >>> By the way, the reason of the "invalid snapshot identifier" is
> >>> that some worker threads try to use it after the connection on
> >>> the first worker closed.
> 
> >> ... BTW, I don't quite see what the issue is there.
> 
> > The master session died from lack of libz and the failure of
> > compressLevel's propagation already fixed. Some of the children
> > that started transactions after the master's death will get the
> > error.
> 
> I don't think I believe that theory, because it would require the master
> to not notice the lack of libz before it launches worker processes, but
> instead while the workers are working.


The master actually *didn't* notice the lack of libz until it
launces worker processes before cae2bb1. So the current master
don't suffer the problem, but it is not desirable that sudden
death from any reason of a child causes this kind of behavior.

>  But AFAICS, while there are worker
> processes open, the master does nothing except wait for workers and
> dispatch new jobs to them; it does no database work of its own.  So the
> libz-isn't-there error has to have occurred in one of the workers.

Yes, the firstly-commanded worker dies from that then the master
disconencts its connection owning the snapshot before terminating
any other workers. It occurs with the current master (9ee56df)
minus cae2bb1, having --without-zlib at configure.

> > If we want prevent it perfectly, one solution could be that
> > non-master children explicitly wait the master to arrive at the
> > "safe" state before starting their transactions. But I suppose it
> > is not needed here.
> 
> Actually, I believe the problem is in archive_close_connection, around
> line 295 in HEAD: once the master realizes that one child has failed,
> it first closes its own database connection and only second tries to kill
> the remaining children.  So there's a race condition wherein remaining
> children have time to see the missing-snapshot error.

Agreed.

> In the patch I posted yesterday, I reversed the order of those two
> steps, which should fix this problem in most scenarios:
> https://www.postgresql.org/message-id/[email protected]

Yeah, just transposing DisconnectDatabase and ShutdownWorkersHard
in archive_close_connection fixed the problem.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel pg_dump's error reporting doesn't work worth squat

Reply via email to