My (our) complaints about EC2 aren't particularly extensive, but last time I
posted to the mailing list saying they were using EC2, the first reply was
someone saying that the corruption was the fault of EC2.

Not that we don't have complaints at all (there are some aspects that are
very frustrating), but I was just trying to stave off anyone who was going
to reply saying "Tell them to stop using EC2".

 -- More detail about the script that kills queries:

Honestly, we (or, at least, I) haven't discovered which type of kill they
were doing, but it does seem to be the culprit in some way.  I don't talk to
the customers (that's my boss's job), so I didn't get to ask specifics.  If
my boss did ask specifics, he didn't tell me.

The previous issue involved toast corruption showing up very regularly (e.g.
once a day, in some cases), the end result being that we had to delete the
corrupted rows.  Coming back the next day to see the same corruption on
different rows was not very encouraging.

We found out after that that they had a script running as a daemon that
would, every ten minutes (I believe), check the number of locks on the table
and kill all waiting queries if there were >= 1000 locks.

Even if the corruption wasn't a result of that, we weren't too excited about
the process being there to begin with.  We thought there had to be a better
solution than just killing the processes.  So we had a discussion about the
intent of that script and my boss dealt with something that solved the same
problem without killing queries, then had them stop that daemon and we have
been working with that database to make sure it doesn't go screwy again.  No
new corruption has shown up since stopping that daemon.

That memory allocation issue looked drastically different from the toast
value errors, though, so it seemed like a separate problem.  But now it's
looking like more corruption.

---

We're requesting that they do a few things (this is their production
database, so we usually don't alter any data unless they ask us to),
including deleting those rows.  My memory is insufficient, so there's a good
chance that I'll forget to post back to the mailing list with the results,
but I'll try to remember to do so.

Thank you for the help - I'm sure I'll be back soon with many more
questions.

-Sam

On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Merlin Moncure <mmonc...@gmail.com> writes:
> > On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <s...@consistentstate.com>
> wrote:
> >> So ... yes, it seems that those four id's are somehow part of the
> problem.
> >> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
> >> either), so memtest isn't available, but no new corruption has cropped
> up
> >> since they stopped killing the waiting queries (I just double checked -
> they
> >> were getting corrupted rows constantly, and we haven't gotten one since
> that
> >> script stopped killing queries).
>
> > That's actually a startling indictment of ec2 -- how were you killing
> > your queries exactly?  You say this is repeatable?  What's your
> > setting of full_page_writes?
>
> I think we'd established that they were doing kill -9 on backend
> processes :-(.  However, PG has a lot of track record that says that
> backend crashes don't result in corrupt data.  What seems more likely
> to me is that the corruption is the result of some shortcut taken while
> shutting down or migrating the ec2 instance, so that some writes that
> Postgres thought got to disk didn't really.
>
>                        regards, tom lane
>

Reply via email to