Guy Helmer wrote:
Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.
So the last two days I have been round upgrading all our servers,
knowing
that I had run the system stably on identical hardware for some time.
Since then I have starte seeing machines lock up. This always happens
under
heavy disc load. When I bring the machine back up then sometimes it
fails
to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been
written
to the drive, but never made it onto the disc).
The only time I have seen anything useful on the screen was during
one lockup
where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.
Help! :-(
I am now downgrading all the machine to 7.0 as fast as I can - though
the
machine I am trying to compile it on has locked up once during the
compile
so I havent got anywhere so far.
The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.
I can't tell whether my situation is related, but I am seeing lockups
on SMP Supermicro servers with both older (NetBurst-ish) and current
Xeon CPUs. I have been dropping into the kernel debugger and getting
lock information and process backtraces, but so far nothing has been
conclusively identified. I think the issue I'm seeing was introduced
sometime between October 2 and November 24 in the RELENG_7 branch, and
I suppose the next step is to do a binary search for the offending
change.
Guy
FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE
that is causing my Supermicro dual Xeon machines to wedge. I did the
binary search between 2008-10-02 and 2008-11-24 without reproducing any
lockups, and then I went on to search between 2008-11-24 and
2009-01-04. An SMP kernel build from 2008-12-22 (r186409) sources was
stable for over two weeks; a kernel built from 2008-12-29 (r186590)
sources wedged in under 24 hours under moderate load.
It appears that the significant changes between r186409 and r186590 were
r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij -
reverted bce changes). My machines don't have bce interfaces, so I
suspect the ATA changes.
Any thoughts?
Thanks,
Guy
--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"