Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-29 Thread David Powers
It's another possibility, but I think it's still somewhat remote given how long we've been using this method with this code. It's sadly hard to test because taking the full backup without the hard linking is fairly expensive (the databases comprise multiple terabytes). As a possibly unsatisfying

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-28 Thread Benedikt Grundmann
Today we have seen 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0,writing block 9 of relation base/16416/293974676 2013-05-28 04:11:13.316

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-28 Thread Robert Haas
On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann bgrundm...@janestreet.com wrote: Today we have seen 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0,writing

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-23 Thread Robert Haas
On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann bgrundm...@janestreet.com wrote: We are seeing these errors on a regular basis on the testing box now. We have even changed the backup script to shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync the lvm snapshot.

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-23 Thread David Powers
Thanks for the response. I have some evidence against an issue in the backup procedure (though I'm not ruling it out). We moved back to taking the backup off of the primary and all errors for all three clusters went away. All of the hardware is the same, OS and postgres versions are largely the

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-21 Thread Benedikt Grundmann
We are seeing these errors on a regular basis on the testing box now. We have even changed the backup script to shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync the lvm snapshot. It still happens. We have never seen this before we introduced the hot standby. So we

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-16 Thread David Powers
I'll try to get the primary upgraded over the weekend when we can afford a restart. In the meantime I have a single test showing that a shutdown, snapshot, restart produces a backup that passes the vacuum analyze test. I'm going to run a full vacuum today. -David On Wed, May 15, 2013 at 3:53

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-15 Thread Heikki Linnakangas
On 14.05.2013 23:47, Benedikt Grundmann wrote: The only thing that is *new* is that we took the snapshot from the streaming replica. So again my best guess as of now is that if the database crashes while it is in streaming standby a invalid disk state can result during during the following

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-15 Thread David Powers
First, thanks for the replies. This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated. Here is some more background: We have 3 9.2.4 databases using the following setup: - A primary box - A standby box running as a hot streaming replica from the

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-15 Thread Heikki Linnakangas
On 15.05.2013 15:42, David Powers wrote: First, thanks for the replies. This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated. Here is some more background: We have 3 9.2.4 databases using the following setup: The subject says 9.2.3. Are you sure

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-15 Thread Benedikt Grundmann
On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 15.05.2013 15:42, David Powers wrote: First, thanks for the replies. This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated. Here is some more background: We

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-15 Thread Heikki Linnakangas
On 15.05.2013 22:50, Benedikt Grundmann wrote: On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangashlinnakan...@vmware.com The subject says 9.2.3. Are you sure you're running 9.2.4 on all the servers? There was a fix to a bug related to starting a standby server from a filesystem snapshot. I

[HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Benedikt Grundmann
Today we have seen this on our testing database instance: ERROR: could not open file base/16416/291498116.3 (target block 431006): No such file or directory That database get's created by rsyncing the LVM snapshot of the standby, which is a readonly backup of proddb using streaming replication.

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Heikki Linnakangas
On 14.05.2013 14:57, Benedikt Grundmann wrote: Today we have seen this on our testing database instance: ERROR: could not open file base/16416/291498116.3 (target block 431006): No such file or directory That database get's created by rsyncing the LVM snapshot of the standby, which is a

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Benedikt Grundmann
It's on the production database and the streaming replica. But not on the snapshot. production -rw--- 1 postgres postgres 312778752 May 13 21:28 /database/postgres/base/16416/291498116.3 streaming replica -rw--- 1 postgres postgres 312778752 May 13 23:50

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Heikki Linnakangas
On 14.05.2013 16:48, Benedikt Grundmann wrote: It's on the production database and the streaming replica. But not on the snapshot. So, the LVM snapshot didn't work correctly? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Benedikt Grundmann
That's one possible explanation. It's worth noting that we haven't seen this before moving to streaming rep first and we have been using that method for a long time. On Tue, May 14, 2013 at 11:34 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 14.05.2013 16:48, Benedikt Grundmann

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Benedikt Grundmann
I think my previous message wasn't clear enough. I do *NOT* think that LVM snapshot is the culprit. However I cannot discount it as one of the possibilities. But I have no evidence in either /var/log/messages or in dmesg that the LVM snapshot went into a bad state AND we have been using this

Re: [HACKERS] streaming replication, frozen snapshot backup on it and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-14 Thread Amit Kapila
On Tuesday, May 14, 2013 7:19 PM Benedikt Grundmann wrote: It's on the production database and the streaming replica.  But not on the snapshot. production -rw--- 1 postgres postgres 312778752 May 13 21:28 /database/postgres/base/16416/291498116.3 streaming replica -rw--- 1 postgres