Re: [GENERAL] Critical failure of standby

2016-08-20 Thread Jeff Janes
On Mon, Aug 15, 2016 at 7:23 PM, James Sewell wrote: > Those are all good questions. > > Essentially this is a situation where DR is network separated from Prod - > so I would expect the archive command to fail. > archive_command or restore_command? I thought it was

Re: [GENERAL] Critical failure of standby

2016-08-17 Thread James Sewell
Hi, No, this was a one off in a network split situation. I'll check the startup when I get a chance - thanks for the help. Cheers, James Sewell, Solutions Architect Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W*

Re: [GENERAL] Critical failure of standby

2016-08-16 Thread Simon Riggs
On 16 August 2016 at 08:11, James Sewell wrote: > As per the logs there was a crash of one standby, which seems to have > corrupted that standby and the two cascading standby. > >- No backups >- Full page writes enabled >- Fsync enabled > > WAL records are

Re: [GENERAL] Critical failure of standby

2016-08-16 Thread James Sewell
Hey Sameer, As per the logs there was a crash of one standby, which seems to have corrupted that standby and the two cascading standby. - No backups - Full page writes enabled - Fsync enabled Cheers, James Sewell, Solutions Architect Suite 112, Jones Bay Wharf, 26-32 Pirrama Road,

Re: [GENERAL] Critical failure of standby

2016-08-15 Thread Sameer Kumar
On Tue, Aug 16, 2016 at 1:10 PM James Sewell wrote: > Hey, > > I understand that. > > But a hot standby should always be ready to promote (given it originally > caught up) right? > > I think it's a moot point really as some sort of corruption has been > introduced, the

Re: [GENERAL] Critical failure of standby

2016-08-15 Thread James Sewell
Hey, I understand that. But a hot standby should always be ready to promote (given it originally caught up) right? I think it's a moot point really as some sort of corruption has been introduced, the machines still won't wouldn't start after they could see the archive destination again.

Re: [GENERAL] Critical failure of standby

2016-08-15 Thread John R Pierce
On 8/15/2016 7:23 PM, James Sewell wrote: Those are all good questions. Essentially this is a situation where DR is network separated from Prod - so I would expect the archive command to fail. I'll have to check the script it must not be passing the error back through to PostgreSQL. This

Re: [GENERAL] Critical failure of standby

2016-08-15 Thread James Sewell
Those are all good questions. Essentially this is a situation where DR is network separated from Prod - so I would expect the archive command to fail. I'll have to check the script it must not be passing the error back through to PostgreSQL. This still shouldn't cause database corruption though

Re: [GENERAL] Critical failure of standby

2016-08-15 Thread Jeff Janes
On Thu, Aug 11, 2016 at 10:39 PM, James Sewell wrote: > Hello, > > We recently experienced a critical failure when failing to a DR > environment. > > This is in the following environment: > > >- 3 x PostgreSQL machines in Prod in a sync replication cluster >- 3

Re: [GENERAL] Critical failure of standby

2016-08-14 Thread James Sewell
Hello All, The thing which I find a little worrying is that this 'corruption' was introduced either on the network from PROD -> DR, but then also cascaded to both other DR servers (either via replication or via archive_command). Is WAL corruption checked for in any way on standby servers?. Here

Re: [GENERAL] Critical failure of standby

2016-08-12 Thread James Sewell
Hello, I double posted this (posted once from an unregistered email and assumed it would be junked). I'm continuing all discussion on the other thread now. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 *P

Re: [GENERAL] Critical failure of standby

2016-08-12 Thread James Sewell
(from other thread) - 9.5.3 - Redhat 7.2 on VMWare - Single PostgreSQL instance one each machine - Every machine in DR became corrupt, so interestingly this must have been sent to the two cascading nodes via WAL before the crash on the hub DR node - No OS logs indicating

Re: [GENERAL] Critical failure of standby

2016-08-12 Thread Alvaro Herrera
James Sewell wrote: > 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client= (0:0)LOG: > consistent recovery state reached at 3/8811DFF0 > 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client= (0:XX000)FATAL: > invalid memory alloc request size 3445219328 > 2016-08-12 04:43:53 GMT

Re: [GENERAL] Critical failure of standby

2016-08-12 Thread Melvin Davidson
On Fri, Aug 12, 2016 at 1:39 AM, James Sewell wrote: > Hello, > > We recently experienced a critical failure when failing to a DR > environment. > > This is in the following environment: > > >- 3 x PostgreSQL machines in Prod in a sync replication cluster >- 3

[GENERAL] Critical failure of standby

2016-08-12 Thread James Sewell
Hello, We recently experienced a critical failure when failing to a DR environment. This is in the following environment: - 3 x PostgreSQL machines in Prod in a sync replication cluster - 3 x PostgreSQL machines in DR, with a single machine async and the other two cascading from the