At Tue, 31 May 2016 12:29:50 -0400, Tom Lane <t...@sss.pgh.pa.us> wrote in <7445.1464712...@sss.pgh.pa.us> > Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes: > > At Fri, 27 May 2016 13:20:20 -0400, Tom Lane <t...@sss.pgh.pa.us> wrote in > > <14603.1464369...@sss.pgh.pa.us> > >> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes: > >>> By the way, the reason of the "invalid snapshot identifier" is > >>> that some worker threads try to use it after the connection on > >>> the first worker closed. > > >> ... BTW, I don't quite see what the issue is there. > > > The master session died from lack of libz and the failure of > > compressLevel's propagation already fixed. Some of the children > > that started transactions after the master's death will get the > > error. > > I don't think I believe that theory, because it would require the master > to not notice the lack of libz before it launches worker processes, but > instead while the workers are working.
The master actually *didn't* notice the lack of libz until it launces worker processes before cae2bb1. So the current master don't suffer the problem, but it is not desirable that sudden death from any reason of a child causes this kind of behavior. > But AFAICS, while there are worker > processes open, the master does nothing except wait for workers and > dispatch new jobs to them; it does no database work of its own. So the > libz-isn't-there error has to have occurred in one of the workers. Yes, the firstly-commanded worker dies from that then the master disconencts its connection owning the snapshot before terminating any other workers. It occurs with the current master (9ee56df) minus cae2bb1, having --without-zlib at configure. > > If we want prevent it perfectly, one solution could be that > > non-master children explicitly wait the master to arrive at the > > "safe" state before starting their transactions. But I suppose it > > is not needed here. > > Actually, I believe the problem is in archive_close_connection, around > line 295 in HEAD: once the master realizes that one child has failed, > it first closes its own database connection and only second tries to kill > the remaining children. So there's a race condition wherein remaining > children have time to see the missing-snapshot error. Agreed. > In the patch I posted yesterday, I reversed the order of those two > steps, which should fix this problem in most scenarios: > https://www.postgresql.org/message-id/7005.1464657...@sss.pgh.pa.us Yeah, just transposing DisconnectDatabase and ShutdownWorkersHard in archive_close_connection fixed the problem. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers