Three times in the last two weeks, we have experience data corruption secondary 
servers using streaming replication on client systems.  The versions involved 
are 9.0.14, 9.2.5, and 9.3.1.  Each incident was separate; two cases they were 
for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).

The details of each incident are similar, but not identical.

The details of each incident are:

INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of 
an existing, correct primary (P1) for the base backup, and using WAL-E for WAL 
segment shipping.  Both the primary and secondary were running 9.0.14.  S1 
properly connected to the primary once the it was caught up on WAL segments, 
and S1 was then promoted as a primary using the trigger file.

No errors in the log files on either system.

After promotion, it was discovered that there was significant data loss on S1.  
Rows that were present on P1 were missing on S1, and some rows were duplicated 
(including duplicates that violated primary key and other unique constraints).  
The indexes were corrupt, in that they seemed to think that the duplicates were 
not duplicated, and that the missing rows were still present.

Because the client's schema included a "last_updated" field, we were able to 
determine that all of the rows that were either missing or duplicated had been 
updated on P1 shortly (3-5 minutes) before S1 was promoted.  It's possible, but 
not confirmed, that there were active queries (including updates) running on P1 
at the moment of S1's promotion.

As a note, P1 was created from another system (let's call it P0) using just WAL 
shipping (no streaming replication), and no data corruption was observed.

P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as 
the file system) as the data volume.

P1 and S1 have been destroyed at this point.


INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of 
S1y, after having dropped the primary and unique constraints, and restored into 
a new 9.3.1 server, P2.  Duplicate rows were purged, and missing rows were 
added again.  The database, a new primary, was then put back into production, 
and ran without incident.

A new secondary, S2 was created off of the primary.  This secondary was created 
using pg_basebackup using --xlog-method=stream, although the WAL-E archiving 
was still present.

S2 attached to P2 without incident and no errors in the logs, but 
nearly-identical corruption was discovered (although this time without the 
duplicated rows, just missing rows).  At this point, it's not clear if there 
was some clustering in the "last_updated" timestamp for the rows that are 
missing from S2.  No duplicated rows were observed.

P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as 
the file system) as the data volume.

No errors in the log files on either system.

P2 and S2 are still operational.


INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 
system (P3) to a new 9.2.5 system (S3) using streaming replication.  As I 
personally didn't do the steps on this one, I don't have quite as much 
information, but the basics are close to incident #2: When S3 was promoted 
using the trigger file, no errors were observed and the database came up 
normally, but rows were missing from S3 that were present on P3.

P1 is running Centos 6.3 with ext4 as the file system.

P2 is running Centos 6.4 with ext3 as the file system.

Log shipping in this case was done via rsync.

P3 and S3 are still operational.

No errors in the log files on either system.

--

Obviously, we're very concerned that a bug was introduced in the latest minor 
release.  We're happy to gather data as required to assist in diagnosing this.
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to