[HACKERS] Collations and Replication; Next Steps

Matthew Kelly Tue, 16 Sep 2014 11:42:02 -0700

Hello,

Last month, I brought up the following issue to the general mailing list about 
how running streaming replication between machines running different versions 
of glibc can cause corrupt indexes.
http://www.postgresql.org/message-id/ba6132ed-1f6b-4a0b-ac22-81278f5ab...@tripadvisor.com

In the month following, we have done further investigation here at TripAdvisor
and have found that scope of this issue is far more troubling than initially
thought. Hackers seems like appropriate place to present this update because
it will certainly motivate some discussion about the approach to collation
support going forward.

After the initial episode, we thought it was necessary to find the true scope
of the problem. We developed a quick smoke test to evaluate the integrity of
the indexes on a given machine. We understood that the test was not
exhaustive, but it would catch most instances of corrupt indexes given
TripAdvisor's normal database usage pattern. The source code with
documentation about how it works is available at
(https://github.com/mkellycs/postgres_index_integrity_check) for those
interested.

What we found with this simple check was simply frightening. In every single
streaming replica cluster where one or more machines had been commissioned at a
different time, that member was found to be corrupt. When hardware upgrades of
the master had been accomplished with a streaming replication, the new master
was also found to have similar issues. The following numbers are only as small
as they are because our adoption of streaming replication has barely just
begun. So far we have found:

* 8 internal production databases, and 2 live site database servers
effected.
* Up to 3771 rows out of place in a single index (more correctly: 3771
times a row was smaller then the row before it when sorted in ascending order,
the actual number of incorrectly placed rows is probably much higher)
* On the worst offender, there were 12806 rows out of place across 26
indexes
* On average roughly 15% of indexes containing text keys on tables larger
100MB were found to exhibit this issue
* In at least one case, rebuilding a unique index on a master revealed that
the database had allowed 100+ primary key violations.

It sounds like we as a community knew that these issues were theoretically
possible, but I now have empirical evidence demonstrating the prevalence of
this issue on our corpus of international data. Instances of this issue showed
up in indexes of member usernames, location/property names, and even Facebook
url's. I encourage other sufficiently large operations who index
internationalized text to run similar tests; its highly likely they have
similar latent issues that they just have not detected yet.

Here is the simple reality. Collation based indexes, streaming replication,
and multiple versions of glibc/os cannot coexist in a sufficiently large
operation and not cause corrupt indexes. The current options are to collate
all of your indexes in C, or to ensure that all of your machines run exactly
the same OS version.

The first and immediate TODO is to patch the documentation to add warnings
regarding this issue. I can propose a doc patch explaining the issue, if no
one has any objections.

The second and far more challenging problem is how do we fix this issue? As of
our last discussion, Peter Geoghegan revived the proposal of using ICU as an
alternative.
(http://www.postgresql.org/message-id/CAEYLb_WvdCzuL=cyf1xyzjwn-1cvo6kzeawmkbxts3jphtj...@mail.gmail.com)
I do not feel qualified to compare the value of this library to other
options, but I am certainly willing to help with the patch process once a
direction has been selected.

I will be at Postgres Open in Chicago this week, and I will be more than
willing to further discuss the details of what we have found.

Regards,
Matt Kelly

[HACKERS] Collations and Replication; Next Steps

Reply via email to