Re: [HACKERS] emergency outage requiring database restart

Merlin Moncure Fri, 28 Oct 2016 14:18:37 -0700

On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:
> On 10/28/16 8:23 AM, Merlin Moncure wrote:
>>
>> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <st...@mit.edu> wrote:
>>>
>>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmonc...@gmail.com>
>>> wrote:
>>>>
>>>> I think we can rule out faulty storage
>>>
>>>
>>> Nobody ever expects the faulty storage
>
>
> LOL
>
>> Believe me, I know.  But the evidence points elsewhere in this case;
>> this is clearly application driven.
>
>
> FWIW, just because it's triggered by specific application behavior doesn't
> mean there isn't a storage bug. That's what makes data corruption bugs such
> a joy to figure out.
>
> BTW, if you haven't already, I would reset all your storage related options
> and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount
> options, etc. Maybe this is related to the app, but the most helpful thing
> right now is to find some kind of safe config so you can start bisecting.


upthread, you might have noticed that I already did that.   Here is
the other evidence:
*) server running fine for 5+ years
*) other database on same cluster not impacted with 10x write activity
*) no interesting logs reported in  /var/log/messages, dmesg, etc
*) san fabric turns over petabytes/day with no corruption. 100+
postgres clusters, 1000+ sql server clusters (and that's not
production)
*) storage/network teams have been through everything. nothing
intersting/unusual to report
*) we have infrequently run routing (posted upthread) that, when run,
database crashed within minutes
*) after turning on checksums, 30% of invocations of routine resulted
in checksum error
*) problem re-occurred after dump-restore and full cluster rebuild
*) checksum error caused routine rollback.  FWICT this prevented the damage
*) everything is fine now that routine is not being run anymore

you can come up with your conclusion, I've come up with mine.  The
only frustrating thing here is that I can't reproduce out of the
production environment.  If this database goes down I have 30 people
sitting around so I can't take downtime lightly.

> I would also consider alternatives to plsh, just to rule it out if nothing
> else. I'd certainly look at some way to get sqsh out of the loop (again,
> just to get something that doesn't crash). First idea that comes to mind is
> a stand-alone shell script that watches a named pipe for a filename; when it
> gets that file it runs it with sqsh and does something to signal completion.

I do a lot of etl to/from sql server and it's all sqsh based.  If I
can figure out how to reproduce in a better way, I'll zero in on the
problem in about 10 minutes.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

Reply via email to