subject:"Big problems with 7.1 locking up \:\-\("

 yes, do ps - threads in state L or LL and RUN are especially interesting,
 trace of pids 28, 27, and threads wich L on locked chan.

heres the output of alllocks,

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

here are the pages of PS:

http://toybox.twisted.org.uk/~pete/71_lock_ps2/

(next time I boot this I will disable http to avoid getting so many)

I cant see any which are in L, LL or RUN state there though. A few RL
and WL towards the end. Traces on 28 and 27 are here:

http://toybox.twisted.org.uk/~pete/71_trace_28.png
http://toybox.twisted.org.uk/~pete/71_trace_27a.png
http://toybox.twisted.org.uk/~pete/71_trace_27b.png

I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU
state, so may be of interest ?

http://toybox.twisted.org.uk/~pete/71_trace_19.png
http://toybox.twisted.org.uk/~pete/71_trace_16.png

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Chagin Dmitry

On Mon, Jan 19, 2009 at 11:39:08AM +, Pete French wrote:
  yes, do ps - threads in state L or LL and RUN are especially interesting,
  trace of pids 28, 27, and threads wich L on locked chan.
 
 heres the output of alllocks,
 
   http://toybox.twisted.org.uk/~pete/71_show_alllocks.png
 
 here are the pages of PS:
 
   http://toybox.twisted.org.uk/~pete/71_lock_ps2/
 
 (next time I boot this I will disable http to avoid getting so many)
 
 I cant see any which are in L, LL or RUN state there though. A few RL
 and WL towards the end. Traces on 28 and 27 are here:
 
   http://toybox.twisted.org.uk/~pete/71_trace_28.png
   http://toybox.twisted.org.uk/~pete/71_trace_27a.png
   http://toybox.twisted.org.uk/~pete/71_trace_27b.png
 
 I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU
 state, so may be of interest ?
 
   http://toybox.twisted.org.uk/~pete/71_trace_19.png
   http://toybox.twisted.org.uk/~pete/71_trace_16.png
 

Probably it is your case, try please.

http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Probably it is your case, try please.

 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

OK, will give this a try, unless anyone else wants any traces from
this locked machine ? Is there a known way to tickle this bug
when I've rebooted, to make sure it's fixed ?

thanks,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah


   Kris writes:
 You and anyone else seeing performance problems should try to work 
 through the advice given here:


   [1]http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf

Well,  all the people in this thread have noticed that WITH NO CONFIG CHANGES f
rom configs
that worked fine in the past, their systems are very slow and/or locking up (mi
ne are both) with
the stable branch sometime (I noticed it sometime in December, but it got worse
 with the release.)
Most were OK in October; mine (I think) were OK in late November - may narrow t
hings down?  Two of my
systems that lock up have no internal visibility when they do (Soekris 4801's r
outing; the only
time-intensive things running are routing (done in irq context) and pflog.  The
se run with 60+
meg ram free.)  These are complete lockups, though I did manage to get a ps out
 of my laptop last
night by waiting 20 _minutes_ for it to start (!).  This is not a generic perfo
rmance problem.  The laptop
had 55 minutes of cpu time in the softdepflush thread after being up about an h
our and 10 mins;
this might give a hint.  I didn't spot LL/RL state threads at the same time bec
ause I didn't know
to.  Now I do.  BTW - the same ps showed 8 or so user-space procs in R state wi
th NO cpu time; the
kernel was hogging all of it for over an hour.
Firefox did indeed trigger this one as someone else noted.  A soekris doing onl
y routing+nat has no such
excuse...  At least PHK was nice enough to note the watchdog in another thread
:-)

-- Pete

References

   1. http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Probably it is your case, try please.
 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

Well, I have been running this for a while now. I still get this:

http://toybox.twisted.org.uk/~pete/71_lor3.png

On the console, but so far the machine has not crashed. Obviously it's
only been an hour or so as yet, buit given that it was freezing in
about 5 minutes earlier this morning it does look good. So thanks
for a good patch so far ;-)

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

Looks like I spoke too soon - It just locked up again I am afraid.
Sitting there now at the debug prompt. It does, however, look very
different this time: For example here is 'show alllocks':

http://toybox.twisted.org.uk/~pete/71_alllocks2.png

That shows a lot of locks in UDP - is this the kind of thing you
were worried about Robert ?

When I do a 'ps' there are, this time, a number of processes in the 'LL'
and 'L' state. The images of the 'ps' and the traces or those locked
processed are to be found here:

http://toybox.twisted.org.uk/~pete/71_lock_ps3/

I tried to keep the threads which belong to each process together.

What else can I get out of this lockup ? It looks like the most
promising so far...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 There are significant changes in UDP locking between 7.0 and 7.1, so it could
 be that we're looking at a regression there.  If you're able to reproduce this
 reliably, it might well be worth doing a little search-and-replace in
 udp_usrreq.c along the following lines:

INP_RLOCK_ASSERT - INP_WLOCK_ASSERT
INP_RLOCK - INP_WLOCK
INP_RUNLOCK - INP_WUNLOCK

Given that the latest lockup (see other email) has lots of locks in the UDP
code, would you like me to try this next ? The kernel which has just locked
is one using Dimtry's patch from 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130652

I am not sure why that would give me different traces during the lockup
though. I was doing a lot more TCP traffic this time, but that shouldnt
interfere with UDP should it ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Kris Kennaway

Pete Carah wrote:

Kris writes:

You and anyone else seeing performance problems should try to work
through the advice given here:

http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf
http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf

Well, all the people in this thread have noticed that WITH NO CONFIG CHANGES
from configs
that worked fine in the past, their systems are very slow and/or locking up (mine are both) with
the stable branch sometime (I noticed it sometime in December, but it got worse with the release.)
Most were OK in October; mine (I think) were OK in late November - may narrow things down? Two of my
systems that lock up have no internal visibility when they do (Soekris 4801's routing; the only

time-intensive things running are routing (done in irq context) and pflog.
These run with 60+
meg ram free.) These are complete lockups, though I did manage to get a ps out of my laptop last
night by waiting 20 _minutes_ for it to start (!). This is not a generic performance problem. The laptop

had 55 minutes of cpu time in the softdepflush thread after being up about an
hour and 10 mins;
this might give a hint. I didn't spot LL/RL state threads at the same time
because I didn't know
to. Now I do. BTW - the same ps showed 8 or so user-space procs in R state
with NO cpu time; the
kernel was hogging all of it for over an hour.
Firefox did indeed trigger this one as someone else noted. A soekris doing
only routing+nat has no such
excuse... At least PHK was nice enough to note the watchdog in another thread
:-)

Actually, there have been several apparently different problems reported
in this thread, some of which (including the message I replied to) *are*
generic my system is slower problems.

For generic my system hangs problems, see the chapter on kernel
debugging in the handbook or follow the (same) advice given by Robert
earlier in the thread.

Kris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah

I have done some (lots of) kernel debugging in the past.  I have several 
points:

1. I shouldn't *have* to kernel debug for a normal usage of an
official release.

2. One of the soekris boxes is 2800 MILES away, in a remote location,
with noone present that is a skilled (or, indeed, any kind of) programmer.
I usually thought I could trust a release, especially when I had been
using the stable branch updated at about monthly intervals on 3 servers
with no problems.  (actually, I waited a while on 7.0 because .0 releases
are traditionally quirky; in this case 7.0-rel worked fine and 7.1 has
problems.)  (and my servers are still running the *same* compilation of
kernel/world with no problems; the hangs are unique to either the laptop
(which only started doing this badly with a Jan 9 csup) and the Soekris boxes
(which started hangs sometime in December; they clearly don't run X...)
[ I've backed my house source to -stable of 12/1/08 and hope this will help;
I don't have the time to fool around too much, and particularly to kernel
debug something that shouldn't need it.]

I can't even start X at all on this laptop now.  At least I can boot it,
but it isn't much use for work unless it can run X.

3. I can't afford the time to debug my tools (freebsd is a tool, not an
experiment, for lots of people, including me...)  I use this laptop at 
work in a place where I am *not* working on freebsd. (nor am I even allowed
to at work...)

-- Pete

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Mark Linimon

On Mon, Jan 19, 2009 at 04:59:59PM -0500, Pete Carah wrote:
 I shouldn't *have* to kernel debug for a normal usage of an
 official release.

Agreed, but the problems that people are having do not seem to have
arisen on any of the systems that ran prelease tests for 7.1.  Although
I'm sure it does not seem that way to you, 7.1R had a very long QA cycle,
and as far as I knew all the showstopper issues had already been addressed
(although I don't officially speak for re@, I'm just an observer.)

With my bugmeister hat on, I'll happily accept suggestions about how
we can get more people involved in testing the prelease images.  Clearly
the situtation we're in right now is not where anyone wants us to be.

mcl
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah

Well, following up on my own reply earlier, I csup'd releng_7 with a 
date of last dec 1; the result works fine
in the laptop.  I'll reload the eastern soekris tonight and see how it 
does.  If the soekris is fine also then this gives a data point for 
whenever the bad commit(s) happened.


I had apparently made the mistaken assumption that a general release 
should be better debugged than the work-in-progress leading up to it...


As I noted before - I'm in the business of *using* computers, not doing 
fbsd kernel work (I actually do linux kernel
(device driver) work in $dayjob, but so far prefer fbsd for general use, 
like routers and servers.)


I need to regen the soekris config here with the 12.01 also; if it 
doesn't hang either then I can hope that someone can
look through commit notes (I certainly don't have the time or internal 
knowledge of 7.x to do so) and try figure out
what may have happened.  My daughter is tired of rebooting the soekris 
that is 2800 miles west of here.


One extra data point: the systems that work OK with the release code 
have Intel chipsets (older - ich3 and ich5).  The laptop is an AMD64 in 
32-bit mode with an ATI chipset and broadcom wireless (hence uses 
project evil, which has its own problems with hangs).  Soekris is Geode 
SC1100 with its own builtin chipset, presumably a mish-mash of things 
from Cyrix, National, and AMD, given the Geode series's history.  It is 
not possible to gen a system or kernel on the soekris; gcc 4.x won't run 
in 128mb of ram with no swap (they run from cf cards; swap is not possible.)
I have to cross-compile them and reload the cf cards externally if 
possible; if not I use nfs (which breaks the system
badly if it hangs during a make install; this happened this past 
weekend, fortunately not on the other coast :-(


-- Pete

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Kris Kennaway


Tomas Randa wrote:

Hello,

I have similar problems. The last good kernel I have from stable 
brach, october the 8. Then in next upgrade, I saw big problems with 
performance.

I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
lot of time with status waiting for opening table or waiting for 
close tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
areca SATA controller. Could not be problem in da device for example?


You and anyone else seeing performance problems should try to work 
through the advice given here:


  http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf

Kris

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Michel Talon

Tomas Randa wrote:
 Hello,
 
 I have similar problems. The last good kernel I have from stable 
 brach, october the 8. Then in next upgrade, I saw big problems with 
 performance.

I can add a me too here. This is on my desktop, very lightly loaded.
This computer never had a single problem under FreeBSD so i don't suspect
a hardware problem. My previous upgrade was FreeBSD 7.0-STABLE #0: Tue
Jul 22, and worked perfectly fine with exactly the same software
configuration. 
Now i have FreeBSD 7.1-STABLE #0: Mon Jan  5 , and the situation is
disastrous. Freshly after boot the machine seems to work normal, but
after a few days it becomes slower and slower, windows takes seconds to
appear, firefox3 begins to have garbled output, etc. Then i had the
following problem, firefox got stuck in kernel, impossible to kill it by
kill -9. Needless to say i inspected everything, dmesg, xsession-errors,
top, etc. without seeing anything suspicious. So i rebooted, and bingo!
the machine paniced, mentioning firefox. But the panic itself get stuck
and i had to push the reset button, so no dump. After reboot, machine
works OK for two or three days, then problems begin again. I am
convinced there is a big problem in the kernel. For reference, here is
top and dmesg:

CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 264M Active, 613M Inact, 485M Wired, 22M Cache, 112M Buf, 116M Free
Swap: 2023M Total, 4K Used, 2023M Free

  PID USERNAME   THR PRI NICE   SIZERES STATE  C   TIME   WCPU
COMMAND
62965 michel   1  440  3532K  1884K CPU1   1   0:00  0.29%
top
 2327 root 1  440   161M 29228K select 1  30:39  0.00%
Xorg
95937 root 1  440 24112K 16800K select 1   2:35  0.00%
kdm-bin_gr
 3099 root 1   40  3304K  1028K select 0   1:30  0.00%
moused
 2209 news 1   80  3464K  1052K wait   0   0:37  0.00%
sh
  884 root 1  440  4712K  2028K select 1   0:12  0.00%
ntpd
  453 _pflogd  1 -580  3380K  1352K bpf0   0:11  0.00%
pflogd
 1634 www  1   40  6268K  2656K kqread 0   0:10  0.00%
lighttpd
  788 root 1  440  3164K  3184K select 0   0:04  0.00%
amd
 2206 news 1  440 15208K 12160K select 0   0:03  0.00%
innd
  879 root 9   40  5432K  2460K kqread 1   0:02  0.00%
nscd
  955 root 1  440  2736K  1216K select 1   0:02  0.00%
master
  758 root 1  440  3164K  1340K select 1   0:02  0.00%
ypbind
...

so no memory problem

Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.1-STABLE #0: Mon Jan  5 14:29:23 CET 2009
mic...@niobe.lpthe.jussieu.fr:/usr/obj/usr/src/sys/NIOBE
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.06GHz (3073.65-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf27  Stepping = 7
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x4400CNXT-ID,xTPR
  Logical CPUs per core: 2
real memory  = 1610530816 (1535 MB)
avail memory = 1568387072 (1495 MB)
ACPI APIC Table: ASUS   P4PE
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
This module (opensolaris) contains code covered by the
Common Development and Distribution License (CDDL)
see http://opensolaris.org/os/licensing/opensolaris_license/
ioapic0 Version 2.0 irqs 0-23 on motherboard
acpi0: ASUS P4PE on motherboard
acpi0: Overriding SCI Interrupt from IRQ 9 to IRQ 22
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a (3) failed
acpi0: reservation of 10, 5ff0 (3) failed
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0xe408-0xe40b on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
agp0: Intel 82845G host to AGP bridge on hostb0
pcib1: ACPI PCI-PCI bridge at device 1.0 on pci0
pci1: ACPI PCI bus on pcib1
vgapci0: VGA-compatible display port 0xd800-0xd8ff mem 
0xe000-0xefff,0xdf00-0xdf00 irq 16 at device 0.0 on pci1
uhci0: Intel 82801DB (ICH4) USB controller USB-A port 0xb800-0xb81f irq 16 at 
device 29.0 on pci0
uhci0: [GIANT-LOCKED]
uhci0: [ITHREAD]
usb0: Intel 82801DB (ICH4) USB controller USB-A on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 on usb0
uhub0: 2 ports with 2 removable, self powered
uhci1: Intel 82801DB (ICH4) USB controller USB-B port 0xb400-0xb41f irq 19 at 
device 29.1 on pci0
uhci1: [GIANT-LOCKED]
uhci1: [ITHREAD]
usb1: Intel 82801DB (ICH4) USB controller USB-B on uhci1
usb1: USB revision 1.0
uhub1: Intel

Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread dick hoogendijk

On Sun, 18 Jan 2009 13:21:17 +0100
Michel Talon ta...@lpthe.jussieu.fr wrote:
 My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked
 perfectly fine with exactly the same software configuration. 
 Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is
 disastrous.

Makes you wonder on on earth could have changed that much between
7.0/7.1 Nice upgrade.. This should not happen on the same hardware!

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS sxce snv105 ++
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Claus Guttesen

 My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked
 perfectly fine with exactly the same software configuration.
 Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is
 disastrous.

 Makes you wonder on on earth could have changed that much between
 7.0/7.1 Nice upgrade.. This should not happen on the same hardware!

There will always be changes when new features/options/enhancements
are introduced. Me for my part have never had any serious trouble with
FreeBSD what so ever since FreeBSD 5.1/2 when some kernel-limits had
to be changed. My problem was solved with the help from this list.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 If you are able to get into the debugger, the normal commands would be most 
 helpful, especially if you can log the results:

It finally locked up, and ctrl-alt-esc got me into the debugger at
last! is there anything else you want me to get whilst it is
like that aside from:

ps
show lockedvnods
show alllocks

which I can go and capture as screenshots. I can probably sort out console
access to it potentially if taht would eb useful whilst it is in this
state ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

ps

output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
there are a lot of processes as this machine runes the same webservices
as the actual webservers, just that nobody connects to them.

show lockedvnods

nothing - there are no locked vnodes

show alllocks

this gives me 'no suich command' theres a whole list of things I
can show, but none of them look like all the locks. what about the locktree
or the lockchain ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Chagin Dmitry

On Fri, Jan 16, 2009 at 12:35:49PM +, Pete French wrote:
 ps
 
 output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
 there are a lot of processes as this machine runes the same webservices
 as the actual webservers, just that nobody connects to them.
 
 show lockedvnods
 
 nothing - there are no locked vnodes
 
 show alllocks
 
 this gives me 'no suich command' theres a whole list of things I
 can show, but none of them look like all the locks. what about the locktree
 or the lockchain ?
 

hi, please type:
show lock 0xff0001254d20
and then show thread 0xXXX where X is 'owner' of previous output.


-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 hi, please type:
 show lock 0xff0001254d20
 and then show thread 0xXXX where X is 'owner' of previous output.

http://toybox.twisted.org.uk/~pete/71_pdns_lock.png

That's in Power DNS - which is interesting because the one difference
between the boxes that lock and those which dont is that the locking
ones are serving DNS.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Chagin Dmitry

On Fri, Jan 16, 2009 at 01:34:14PM +, Pete French wrote:
  hi, please type:
  show lock 0xff0001254d20
  and then show thread 0xXXX where X is 'owner' of previous 
  output.
 
 http://toybox.twisted.org.uk/~pete/71_pdns_lock.png
 
 That's in Power DNS - which is interesting because the one difference
 between the boxes that lock and those which dont is that the locking
 ones are serving DNS.
 

trace 832

-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Robert Watson



On Fri, 16 Jan 2009, Pete French wrote:

hi, please type: show lock 0xff0001254d20 and then show thread 
0xXXX where X is 'owner' of previous output.


http://toybox.twisted.org.uk/~pete/71_pdns_lock.png

That's in Power DNS - which is interesting because the one difference 
between the boxes that lock and those which dont is that the locking ones 
are serving DNS.


I rather feared as much.  Let's run down the path of perhaps there's a 
problem with the new UDP locking code for a bit and see where it takes us. 
Is it possible to run those boxes with WITNESS -- I believe that the fact that 
show alllocks is failing is because WITNESS isn't present.  The other thing 
we can do is revert UDP to using purely write locks -- the risk there is that 
it might change the timing but not actually resolve the bug, so if we can 
analyze it a bit using WITNESS first that would be useful.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 trace 832

http://toybox.twisted.org.uk/~pete/71_trace_832_1.png
http://toybox.twisted.org.uk/~pete/71_trace_832_2.png

-pete.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I rather feared as much.  Let's run down the path of perhaps there's a 
 problem with the new UDP locking code for a bit and see where it takes us. 
 Is it possible to run those boxes with WITNESS -- I believe that the fact that
 show alllocks is failing is because WITNESS isn't present.

Yes, I can do that. The only reason I wasn't running with WITNESS is that
it didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if
a simple GENERIC kernel would lock up when I added that. I will go
back and add WITNESS when you tell me theres nothing more
we can get out of this lock up (recompiling will involve restarting the
machine so I loose the 'boekn to debugger' state). Should I add
anything else ? Skip spinlocks ? Invariants ?

 The other thing we can do is revert UDP to using purely write locks -- the
 risk there is that it might change the timing but not actually resolve the
 bug, so if we can analyze it a bit using WITNESS first that would be useful.

Yes, I will run with WITNESS and anything else you might want. Is there
anything else you, or anyone else, wants from this kernel ? It may take
another day to lock up when I've restarted it unfortunately.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Robert Watson



On Fri, 16 Jan 2009, Pete French wrote:

I rather feared as much.  Let's run down the path of perhaps there's a 
problem with the new UDP locking code for a bit and see where it takes us. 
Is it possible to run those boxes with WITNESS -- I believe that the fact 
that show alllocks is failing is because WITNESS isn't present.


Yes, I can do that. The only reason I wasn't running with WITNESS is that it 
didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if a 
simple GENERIC kernel would lock up when I added that. I will go back and 
add WITNESS when you tell me theres nothing more we can get out of this lock 
up (recompiling will involve restarting the machine so I loose the 'boekn to 
debugger' state). Should I add anything else ? Skip spinlocks ? Invariants ?


The other thing we can do is revert UDP to using purely write locks -- the 
risk there is that it might change the timing but not actually resolve the 
bug, so if we can analyze it a bit using WITNESS first that would be 
useful.


Yes, I will run with WITNESS and anything else you might want. Is there 
anything else you, or anyone else, wants from this kernel ? It may take 
another day to lock up when I've restarted it unfortunately.


If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. 
WITNESS does a number of things, including tracking (and being judgemental 
about) lock order.  One nice side effect of that tracking is that we keep 
track of a lot more lock state explicitly, so DDB's show allocks, show 
locks, etc, commands can build on that.  show lockedvnods works without 
WITNESS, though, so your results so far suggest this is likely not related to 
vnode locking.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. 
 WITNESS does a number of things, including tracking (and being judgemental 
 about) lock order.  One nice side effect of that tracking is that we keep 
 track of a lot more lock state explicitly, so DDB's show allocks, show 
 locks, etc, commands can build on that.  show lockedvnods works without 
 WITNESS, though, so your results so far suggest this is likely not related to 
 vnode locking.

Right, I've gone back to my DEBUG kernel which has a lot of options in it,
including all the above. It has locked almost immediately luckily, so
now I have it sitting at the debugger prompt. The output from 'show alllocks'
is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

Which of these are worth tracing ?

-pte.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

Just confinuing to look at this with the help of Dimity, and the
output from 'bt' is here:

http://toybox.twisted.org.uk/~pete/71_bt.png

The top bit of that is from my 'show alllocks' the full version
of whih is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French

Just an update on this - I tried the various kernels, but now the machine is
not locking up at all. As I havent actually chnaged anything then this does
not make me as happy as you might expect. I don;t know what to do now - I 
daare not upgrade the machines to an OS that I know locks, but if I cant
make it lock then it is impossible to get any useful debugging info out
of.

maybe waiting for 7.2 is the best move...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson



On Thu, 15 Jan 2009, Pete French wrote:

Just an update on this - I tried the various kernels, but now the machine is 
not locking up at all. As I havent actually chnaged anything then this does 
not make me as happy as you might expect. I don;t know what to do now - I 
daare not upgrade the machines to an OS that I know locks, but if I cant 
make it lock then it is impossible to get any useful debugging info out of. 
maybe waiting for 7.2 is the best move...


Well, one slightly pessimistic (or realistic) view says that all software 
contains bugs, it's just a question of whether or not your workload and 
environment trigger those bugs in a noticeable way.


Given the inconsistency of the symptoms, I wouldn't preclude something 
environmental: could it be that it was the bottom, or more likely, top box in 
a rack and that your air conditioning isn't quite as effective there when the 
outside temperature is above/below some threshold?  Alternatively, could it be 
that the workload changed very slightly -- you're doing less DNS queries, or 
the network latency to the DNS server changed?


Certainly, whoever gave the advise on checking BIOS revisions is right: you 
can spend a lot of time tracking down a bug to realize that one box has a 
slightly different BIOS rev and therefore does/doesn't suffer from an obscure 
SMI bug.


In any case, if it starts to reproduceably recur, send out mail and we can see 
if we can track it down some more.  BTW, did you establish if the version of 
iLo you have has a remote NMI?  I seem to recall that some do, and being able 
to deliver an NMI is really quite valuable.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French

 Given the inconsistency of the symptoms, I wouldn't preclude something 
 environmental: could it be that it was the bottom, or more likely, top box in 
 a rack and that your air conditioning isn't quite as effective there when the 
 outside temperature is above/below some threshold?

It's a possibility - but the two machines which were exhibiting the fault
are in Slough and Baton Rouge respectively, so under very diferent cliatic
conditions. Howevere, something, has chhnaged to make it stop locking up!
The USA one was doing it every couple of hours at the start of the week, and
the UK on wouldnt last more than half an hour at one point.

 Alternatively, could it be that the workload changed very slightly -- you're
 doing less DNS queries, or the network latency to the DNS server changed?

Also a possibility - that workload is entirely dependent on customer behaviour
which is an unpredictable beast!

 Certainly, whoever gave the advise on checking BIOS revisions is right: you 
 can spend a lot of time tracking down a bug to realize that one box has a 
 slightly different BIOS rev and therefore does/doesn't suffer from an obscure 
 SMI bug.

Yes, thats next on my list - make sure they are all on the same version.

 In any case, if it starts to reproduceably recur, send out mail and we can see
 if we can track it down some more.  BTW, did you establish if the version of 
 iLo you have has a remote NMI?  I seem to recall that some do, and being able 
 to deliver an NMI is really quite valuable.

OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon
demand, so that could be used if/whhen the fault crops up again.

thanks, will let this lie for now and resurrect the thread when I can
get some more useful data.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson



On Thu, 15 Jan 2009, Pete French wrote:

In any case, if it starts to reproduceably recur, send out mail and we can 
see if we can track it down some more.  BTW, did you establish if the 
version of iLo you have has a remote NMI?  I seem to recall that some do, 
and being able to deliver an NMI is really quite valuable.


OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon 
demand, so that could be used if/whhen the fault crops up again.


thanks, will let this lie for now and resurrect the thread when I can get 
some more useful data.


Excellent WRT NMI.  As long as you have DDB, KDB, and BREAK_TO_DEBUGGER 
compiled into the kernel, generating that should reliably get you into the 
debugger.  If it's possible to keep running with INVARIANTS and WITNESS, or 
just INVARIANTS if WITNESS slows things down too much, that would be 
desirable.  You might want to give the NMI a test run just to make sure it 
behaves as you think it should, though -- be aware that if DDB/KDB aren't 
compiled into the kernel, then an NMI will panic the box.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French

 desirable.  You might want to give the NMI a test run just to make sure it 
 behaves as you think it should, though -- be aware that if DDB/KDB aren't 
 compiled into the kernel, then an NMI will panic the box.

Unfortunately it does this...

http://toybox.twisted.org.uk/~pete/71_nmi1.png

That is locked up too - hitting return does nothing. I was hoping it
was just garbled output but had actually gone to the debugger.
Apparently not.

Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER,
which does work as I have tested it with CTRL_ALT_ESC.

M

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson


On Thu, 15 Jan 2009, Pete French wrote:

desirable.  You might want to give the NMI a test run just to make sure it 
behaves as you think it should, though -- be aware that if DDB/KDB aren't 
compiled into the kernel, then an NMI will panic the box.


Unfortunately it does this...

http://toybox.twisted.org.uk/~pete/71_nmi1.png

That is locked up too - hitting return does nothing. I was hoping it was 
just garbled output but had actually gone to the debugger. Apparently not.


Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which 
does work as I have tested it with CTRL_ALT_ESC.


Er, that's rather upsetting.  John, do you have any ideas about this?

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread John Baldwin

On Thursday 15 January 2009 12:49:11 pm Robert Watson wrote:
 On Thu, 15 Jan 2009, Pete French wrote:
 
  desirable.  You might want to give the NMI a test run just to make sure 
it 
  behaves as you think it should, though -- be aware that if DDB/KDB aren't 
  compiled into the kernel, then an NMI will panic the box.
 
  Unfortunately it does this...
 
  http://toybox.twisted.org.uk/~pete/71_nmi1.png
 
  That is locked up too - hitting return does nothing. I was hoping it was 
  just garbled output but had actually gone to the debugger. Apparently not.
 
  Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which 
  does work as I have tested it with CTRL_ALT_ESC.
 
 Er, that's rather upsetting.  John, do you have any ideas about this?

The rest of the thread I have no context on still.  The garbage is due to 
competing panics I think.  The problem is we don't single thread the printf's 
in 'trap_fatal()'.  We should probably have some sort of simple spin lock 
thing in the x86 code to only allow 1 CPU at a time to run through that 
routine.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Pete French

 If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
 ctrl-alt-break on the console to see if you can drop into the debugger, or 
 issue a serial break on a serial console.

Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained
all the other stuff (WITNESS etc...). The end result...

...it no longer crashes :-(

I am not sure what to make of that! Wat could adding this to the kernel
possibly do which would make my problems go away ? Should I try just
adding this option to my GENERIC kernel and seeing if that also gives me
something stable ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Robert Watson


On Wed, 14 Jan 2009, Pete French wrote:

If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
ctrl-alt-break on the console to see if you can drop into the debugger, or 
issue a serial break on a serial console.


Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained 
all the other stuff (WITNESS etc...). The end result...


...it no longer crashes :-(

I am not sure what to make of that! Wat could adding this to the kernel 
possibly do which would make my problems go away ? Should I try just adding 
this option to my GENERIC kernel and seeing if that also gives me something 
stable ?


Yeah, that is unexpected -- the BREAK_TO_DEBUGGER path should have almost know 
effect on control flow, unlike, say, WITNESS, which significantly distorts 
timing.  Is there any chance you picked up any of the recent fixes that went 
into RELENG_7 without noticing, and that perhaps one of those did it?  With 
regard to what to do: if you didn't pick up a fix without noticing, yeah, I 
think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at 
least, masked) the problem.  Generally with this sort of testing one has to be 
pretty rigorous in testing assumptions, because it's easy for changes to sneak 
in.  Particularly annoying are seemingly innocuous code changes that do things 
like slightly rearrange kernel memory.


FWIW, I suspect the various reports we are seeing reflect more than one 
problem, and that they must be relatively edge-case individually but reports 
of a few problems have lead to more coming out of the woodwork.  Obviously, 
the problems are not edge-case to the people experiencing them...


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Pete French

 effect on control flow, unlike, say, WITNESS, which significantly distorts 
 timing.  Is there any chance you picked up any of the recent fixes that went 
 into RELENG_7 without noticing, and that perhaps one of those did it?  With 

I'm pretty certian of that - I hav just been changing kernel config
files, I havent actually csup'd at all.

 regard to what to do: if you didn't pick up a fix without noticing, yeah, I 
 think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at 
 least, masked) the problem.

OK. I think I need at leats 4 kernels to try here: GENERIC (which should
show the problenm), my original DEBUG (which also shows the problem) plus
both of those with BREAK_TO_DEBUGGER included to see if that fixes it. Can
I just add BREAK_TO_DEBUGGER on its own to a config file ? I was wondering
if I need to include one of the other debugger options so that it has
something to break to ?

 FWIW, I suspect the various reports we are seeing reflect more than one 
 problem, and that they must be relatively edge-case individually but reports 
 of a few problems have lead to more coming out of the woodwork.  Obviously, 
 the problems are not edge-case to the people experiencing them...

I was thinking that too - I've been guilty of this in the past too, lumping
my problem in with others under the asusmption that it's all the same. This
is onbiously pretty rare - out of 24 of the HP servers the problems only crops
up on 4 of them. But there is nothing dfferent about those 4.

I will let you know what my various kerenl compiles give me - am buolding
again from scratch, which is slow with WITNESS enabled.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Claus Guttesen

 my problem in with others under the asusmption that it's all the same. This
 is onbiously pretty rare - out of 24 of the HP servers the problems only crops
 up on 4 of them. But there is nothing dfferent about those 4.

Could it be different bios/firmware on the hp-servers?

Mr. Aliyev was unable to install 7.1 release on amd64 on a DL380 G5.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Doug Barton

Pete French wrote:
 Mine never lock up doing buildworlds either. They only lock up when they are
 sitting there more of less idle! The machines which have never locked up
 are the webservers, which are fairly heavlt loaded. The machine which locks
 up the most frequently is a box sitting there doing nothing but DNS, which is
 the most lightly loaded of the lot.

Silly question but do you have powerd enabled on that server? If so,
does disabling it help? Also do you have any of these in /etc/rc.conf
(i.e., they are not the same as the default values in
/etc/defaults/rc.conf):
performance_cx_lowest=HIGH# Online CPU idle state
performance_cpu_freq=NONE # Online CPU frequency
economy_cx_lowest=HIGH# Offline CPU idle state
economy_cpu_freq=NONE # Offline CPU frequency


Doug

-- 

This .signature sanitized for your protection

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Claus Guttesen

 Mine never lock up doing buildworlds either. They only lock up when they are
 sitting there more of less idle! The machines which have never locked up
 are the webservers, which are fairly heavlt loaded. The machine which locks
 up the most frequently is a box sitting there doing nothing but DNS, which is
 the most lightly loaded of the lot.

The server has been idle for a day now and is up and running. I have
then copied a file to generate some i/o and it copies without
problems.

for ((a=0;a10;a++))
  do
  cp netbeans-6.5-ml-macosx.dmg ${a}.dmg 
done

I can't  (fortunately) make it lock up. I have a DL360 G5 which is
unused atm. and can test on it if needed.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Gavin Atkinson

On Mon, 2009-01-12 at 19:00 +, Pete French wrote:
  I'm not sure if you've done this already, but the normal suggestions apply: 
  have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
  any results / panics / etc result?  Sometimes these debugging tools are 
  able 
  to convert hangs into panics, which gives us much more ability to debug 
  them. 
 
 OK, I have now had a machine hand again, with the correct debug options in
 the kernel. The screen looked like this when I went to restart it:
 
   http://toybox.twisted.org.uk/~pete/71_lor2.png
 
 It had not, however, dropped into any kind of debugger. Also there appear
 to me console messages after the lock order reversal - is that normal ?
 
 The machine did stay up for a signifanct amount of time before doing this. I
 notice that it is more or less identical to the one I posted whenI
 had WITNESS_KDB in the kernel too, so maybe those results arent
 entirely suprious after all ?
 
 Given it hasnt dropped to a debugger, is there anything else I can try ? 

Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
over the serial line?

Gavin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Lock order reversals are warnings of potential deadlock due to a lock cycle, 
 but deadlocks may not actually result, either because it's a false positive 
 (some locking construct that is deadlock free but involves lock cycles), or 
 because a cycle didn't actually form.  The message is suggestive, but if you 
 have significant system activity after the message, then it may be unrelated.

Its hard to tell in this case as there are no timestamps, so I cant
see if there is any activity after the lockup.

 Features like WITNESS and INVARIANTS may change the timing of the kernel 
 making certain race conditions less likely; I'd run with them for a bit and 
 see if you can reproduce the hang with them present, as they will make 
 debugging the problem a lot easier, if it's possible.

Uh, the above *was* me reproducing the hang with them present ;-)) It
quite happily hangs with thoise things in the kernel - indeed the next
hang was immediately after I rebooted the machine. But even with WITNESS
and INVARIANTS and all the rest it does not drop to a debugger, it
simply locks up.

That machine is currently turned off, but still has 7.1 installed. What
would you like me to try now ? I have a lockup I can reproduce pretty
reliably now (just wait and it will always lock up). I also found that
my other 7.1 box locks up fairly reliably when doing a buildworld.

The only similarily between these two machines and the ones which dont
lock up is that these are serving DNS. The others don't. Note that all
the hardware is identical, as is the installed software and the configuration.

I am at a total loss...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 It was mentioned previous in this thread that CPUTYPE could be an
 issue. Did you change this if you customized your kernel?

Actually, I think thats been ruled out as a possible cause, along
with the scheduler. Certainly I have tried it both ways and
there is no difference, and I think i saw that the others had too.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Robert Watson



On Tue, 13 Jan 2009, Pete French wrote:

Features like WITNESS and INVARIANTS may change the timing of the kernel 
making certain race conditions less likely; I'd run with them for a bit and 
see if you can reproduce the hang with them present, as they will make 
debugging the problem a lot easier, if it's possible.


Uh, the above *was* me reproducing the hang with them present ;-)) It quite 
happily hangs with thoise things in the kernel - indeed the next hang was 
immediately after I rebooted the machine. But even with WITNESS and 
INVARIANTS and all the rest it does not drop to a debugger, it simply locks 
up.


That machine is currently turned off, but still has 7.1 installed. What 
would you like me to try now ? I have a lockup I can reproduce pretty 
reliably now (just wait and it will always lock up). I also found that my 
other 7.1 box locks up fairly reliably when doing a buildworld.


The only similarily between these two machines and the ones which dont lock 
up is that these are serving DNS. The others don't. Note that all the 
hardware is identical, as is the installed software and the configuration.


If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
ctrl-alt-break on the console to see if you can drop into the debugger, or 
issue a serial break on a serial console.  For somewhat complicated reasons to 
explain, serial breaks are more effective at getting into the debugger, so are 
preferable -- also because you can more easily log output from the debugger.


If you are able to get into the debugger, the normal commands would be most 
helpful, especially if you can log the results:


  ps
  show lockedvnods
  show alllocks

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
 over the serial line?

No, ctrl-alt-esc doesnt work, and there is no serial line on the machine (not
that I can access anyway)

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Silly question but do you have powerd enabled on that server? If so,
 does disabling it help? Also do you have any of these in /etc/rc.conf
 (i.e., they are not the same as the default values in
 /etc/defaults/rc.conf):
 performance_cx_lowest=HIGH# Online CPU idle state
 performance_cpu_freq=NONE # Online CPU frequency
 economy_cx_lowest=HIGH# Offline CPU idle state
 economy_cpu_freq=NONE # Offline CPU frequency

No, none of those. My rc.conf is below. The only slightly unusual thing I
am doing is using lagg rather than the interfaces directly I guess, but
that has worked fine for ages.

-pete.


hostname=florentine.rattatosk
cloned_interfaces=lagg0
network_interfaces=lo0 bce0 bce1 lagg0
ifconfig_bce0=up
ifconfig_bce1=up
ifconfig_lagg0=laggproto lacp laggport bce0 laggport bce1

ipv4_addrs_lagg0=10.48.19.0/16 10.48.19.229/16 10.48.19.223/16 10.48.19.243/16 
10.48.19.226/16 10
.48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16 
10.48.19.232/16 10.4
8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16

defaultrouter=10.48.0.9

inetd_enable=YES
sshd_enable=YES

dhcpd_enable=YES
dhcpd_ifaces=lagg0
dhcpd_flags=-q
dhcpd_conf=/usr/local/etc/dhcpd.conf
dhcpd_withumask=022

nfs_client_enable=YES
nfs_server_enable=YES
portmap_enable=YES
rpcbind_enable=YES

named_enable=YES
pdns_enable=YES
pdns_recursor_enable=NO

mysql_enable=YES

apache22_http_accept_enable=YES
apache22_enable=YES

ntpd_enable=YES
ntpd_sync_on_start=YES

exim_enable=YES
exim_flags=-bd -q10m
sendmail_enable=NONE
sendmail_submit_enable=NO
sendmail_outbound_enable=NO
sendmail_msp_queue_enable=NO
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I can't  (fortunately) make it lock up. I have a DL360 G5 which is
 unused atm. and can test on it if needed.

Would it be possible to install that under amd64 and hammer it with
DNS requests ? I have been trying to think what the difference might be
between my webservers and the machines which are freezing, and the opnly
one I an come up with is UDP traffic as the locking machines are serving
DNS and also NFS.

-pete.
,.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

RE: Big problems with 7.1 locking up :-(

2009-01-13 Thread Nathan Way

I also am experiencing lock-ups on a server recently upgraded from
7.0-RELEASE to 7.1-STABLE.  This server is a Supermicro 6022 dual-Xeon
box running a GENERIC i386 SMP kernel.  Since upgrading to 7.1-STABLE it
has started locking up daily.  I see similar symptoms that Pete is
seeing - no ping response, no keyboard response, no video output on a
very lightly loaded server.  

I have a test machine with duplicate hardware to the one locking up that
I just finished installing 7.1-STABLE on but so far it hasn't locked up.
Coincidentally my locking machine is also a DNS server but I have not
enabled DNS on my test machine yet.

Since the locking server is remote to me, I need to downgrade it to 7.0
to get it stable again.  Once I finish that process, I can provide
remote access to the 7.1-STABLE machine in my office if anyone would
like to test with it.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Robert Watson



On Tue, 13 Jan 2009, Pete French wrote:

I can't (fortunately) make it lock up. I have a DL360 G5 which is unused 
atm. and can test on it if needed.


Would it be possible to install that under amd64 and hammer it with DNS 
requests ? I have been trying to think what the difference might be between 
my webservers and the machines which are freezing, and the opnly one I an 
come up with is UDP traffic as the locking machines are serving DNS and also 
NFS.


There are significant changes in UDP locking between 7.0 and 7.1, so it could 
be that we're looking at a regression there.  If you're able to reproduce this 
reliably, it might well be worth doing a little search-and-replace in 
udp_usrreq.c along the following lines:


  INP_RLOCK_ASSERT - INP_WLOCK_ASSERT
  INP_RLOCK - INP_WLOCK
  INP_RUNLOCK - INP_WUNLOCK

However, before making these changes for debugging purposes, make sure it's 
100% reproduceable without them in the configuration so that we don't find 
ourselves barking up the wrong tree.  Normally deadlocks along these lines 
*do* allow breaking into the debugger from a serial console, but since there 
are significant changes here in 7.1 it is worth trying to see if this might be 
related.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Ken Smith

On Mon, 2009-01-12 at 21:35 +0100, Tomas Randa wrote:
 I have similar problems. The last good kernel I have from stable 
 brach, october the 8. Then in next upgrade, I saw big problems with 
 performance.
 I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.
 
 Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
 lot of time with status waiting for opening table or waiting for 
 close tables
 
 I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
 areca SATA controller. Could not be problem in da device for example?
 
 Thanks Tomas Randa

Could you give r186860 a try?  It is an MFC into stable/7 so if the
machine in question is something you can experiment with just updating
to stable/7 would take care of it.  Otherwise if you could just manually
apply the patch to a 7.1 source tree and do a test build of the kernel
that would also do it.

I'm not experiencing lockups but this patch helped a lot on a machine I
have with a particular disk I/O pattern that resulted in extremely poor
performance with 7.1-RELEASE.  This patch brought it back to its normal
performance level.

Thanks.

-- 
Ken Smith
- From there to here, from here to  |   kensm...@cse.buffalo.edu
  there, funny things are everywhere.   |
  - Theodore Geisel |



signature.asc
Description: This is a digitally signed message part

Re: Big problems with 7.1 locking up :-(

 I've updagraded a test-webserver to 7.1 when it was released. After a
 few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
 has been running without any problems. The webserver is not heavily
 loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
 runs fine.

 If the reported lockup is due to i/o a buildworld will not be able to
 reproduce it.

 It has performed a buildworld without problems and I'll be doing some
 buildworlds throughout the day.

 This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
 build-in p200-controller with 64 MB ram.

Forgot to add that CPUTYPE=nocona in /etc/make.conf.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I am also surprised that this isn't more widely reported, as
 the hardware is very common. The only oddity with ym compile
 is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
 I will remove it anyway, just so I am actually building a completely
 vanilla amd64. That way I should have what everyone else has, and since
 I don't see anyone else saying they have isues then maybe mine will
 go away too (fingers crossed)

Intel suggests nocona for x86_64 platforms and prescott for x86
 (i386) based platforms on the 4.2 line, because they best matched the
 cache size and featureset of the Core2 processors.

I don't think that core2 support was fully completed in 4.2 (in
 fact I believe it was just started), and I don't think that our
 binutils supports it properly.

 Some thoughts,
 -Garrett

I've updagraded a test-webserver to 7.1 when it was released. After a
few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
has been running without any problems. The webserver is not heavily
loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
runs fine.

If the reported lockup is due to i/o a buildworld will not be able to
reproduce it.

It has performed a buildworld without problems and I'll be doing some
buildworlds throughout the day.

This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
build-in p200-controller with 64 MB ram.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 It has performed a buildworld without problems and I'll be doing some
 buildworlds throughout the day.

 This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
 build-in p200-controller with 64 MB ram.

I've performed five buildworlds decrementing -j from 16 to 6 and I
can't lock up the server.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I've performed five buildworlds decrementing -j from 16 to 6 and I
 can't lock up the server.

Mine never lock up doing buildworlds either. They only lock up when they are
sitting there more of less idle! The machines which have never locked up
are the webservers, which are fairly heavlt loaded. The machine which locks
up the most frequently is a box sitting there doing nothing but DNS, which is
the most lightly loaded of the lot.

I am going to roll back to 7.0 on all of the HP machines now, having
had yet another day of rebooting locked up machines. I will leave one
running 7.1 with the debug options in the kernel to try and get some
useful results out of this. All the machines are now running GENERIC with
no specail optimisations, CPU types or anything like that. Absolutely out
of the box vanilla 7.1/amd64 as far as I know :-(

-pete.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(



On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 perfectly 
happily. I have been testing 7.1 in it's various incarnations for the last 
couple of months on our test server and it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] We've 
since then compiled the kernel under the BSD scheduler to rule that out, 
and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?


FWIW, the other guy I know who is having this problem had already switched 
to using ULE under 7.0-release, and did not have any problems with it.  So 
*his* problem was probably not related to SCHED_ULE, unless something has 
recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's going to 
try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that have 
existed in the code for a long time but never really manifested themselves. 
ULE is well shaken-out, having been under development for at least five years, 
but it is possible that some problems will become visible as a result of the 
switch.  I would encourage people to stick with ULE, but if you're having a 
stability problem then experimenting with scheduler as a variable that could 
be triggering the problem may well be useful to help track down the bug.  Most 
of the time the bugs will not be in ULE itself, rather, triggered because ULE 
will change the ordering or balancing of work in the system, so we should try 
to avoid situations where people switch to 4BSD from ULE and stick with it 
rather than getting the underlying problem fixed!


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(


On Sat, 10 Jan 2009, Pete French wrote:

FWIW, the other guy I know who is having this problem had already switched 
to using ULE under 7.0-release, and did not have any problems with it.  So 
*his* problem was probably not related to SCHED_ULE, unless something has 
recently changed there.


Well, one of my machines just locked up again, even with SCHED_4BSD on it, 
so I am now thinking it is unrelated.


The machine has completely locked - no response to pings, no response to 
keypresses, nor to the power button. There is nothing printed on the console 
- it is just sitting there with a login prompt :-(


This is really not good - these are extremely common servers after all, and 
I am just running bog standard 7.1 with apache and mysql. This is happening 
across several different servers, all of which are slight variants on the 
DL360, so I dont think it is something perculiar to me.


I'm not sure if you've done this already, but the normal suggestions apply: 
have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
any results / panics / etc result?  Sometimes these debugging tools are able 
to convert hangs into panics, which gives us much more ability to debug them. 
If it still hangs rather than panicking, are you able to break into the 
debugger on the console?  If you're using a video console and not able to get 
to the debugger, would it be possible to configure a serial console and use 
that -- serial breaks are often more successful at getting to the debugger 
than keyboard breaks.  Likewise, I'm not sure if this hardware has an NMI 
button -- some HP servers have one on the motherboard that you can press -- 
but that is also potentially a way to get into the debugger the analyze the 
crash.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I'm not sure if you've done this already, but the normal suggestions apply: 
 have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
 any results / panics / etc result?  Sometimes these debugging tools are able 
 to convert hangs into panics, which gives us much more ability to debug them. 

I did, but it turns out I had an incorrect option in there which made the
data I got not relevent. I now have another machine running a kernel
with the following config:

include GENERIC
ident   DEBUG

options KDB
options DDB
options SW_WATCHDOG
options DEBUG_VFS_LOCKS
options MUTEX_DEBUG
options WITNESS
options LOCK_PROFILING
options INVARIANTS
options INVARIANT_SUPPORT
options DIAGNOSTIC

Those should enable me to get some useful output I hope.

 If it still hangs rather than panicking, are you able to break into the 
 debugger on the console?  If you're using a video console and not able to get 
 to the debugger, would it be possible to configure a serial console and use 

I cant add a sserial console - I am remote enough from most of
these machines (Slough) and very remote from the test box (its in the USA!)
so I cant get to them physicly. But I do have iLo which lets me use the
console and gives me a bit of access to the front. I will check for NMI.

Just had another lockup here - my working day has become a succession of
running round rebooting servers though iLo at the moment.

Will get back to you when the debug one has crashed - I could possibly
give you direct access to the iLo console on that if you need it ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Garance A Drosihn


At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various 
incarnations for the last couple of months on our test server and 
it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] 
We've since then compiled the kernel under the BSD scheduler to 
rule that out, and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any 
problems with it.  So *his* problem was probably not related to 
SCHED_ULE, unless something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's 
going to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that 
have existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development 
for at least five years, but it is possible that some problems will 
become visible as a result of the switch.  I would encourage people 
to stick with ULE, but if you're having a stability problem then 
experimenting with scheduler as a variable that could be triggering 
the problem may well be useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging problems.
So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...

--
Garance Alistair Drosehn=   g...@gilead.netel.rpi.edu
Senior Systems Programmer   or  g...@freebsd.org
Rensselaer Polytechnic Instituteor  dro...@rpi.edu
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I'm not sure if you've done this already, but the normal suggestions apply: 
 have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
 any results / panics / etc result?  Sometimes these debugging tools are able 
 to convert hangs into panics, which gives us much more ability to debug them. 

OK, I have now had a machine hand again, with the correct debug options in
the kernel. The screen looked like this when I went to restart it:

http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear
to me console messages after the lock order reversal - is that normal ?

The machine did stay up for a signifanct amount of time before doing this. I
notice that it is more or less identical to the one I posted whenI
had WITNESS_KDB in the kernel too, so maybe those results arent
entirely suprious after all ?

Given it hasnt dropped to a debugger, is there anything else I can try ? 

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 Just to followup on this:  My friend did switch back to a 7.1 kernel with
 SCHED_4BSD, and he still ran into problems.  The error messages weren't

Acually, I dont know if I posted it, but that was the same for me too.
The scheduler makes no difference, nor do CPU copile settings.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Tomas Randa


Hello,

I have similar problems. The last good kernel I have from stable 
brach, october the 8. Then in next upgrade, I saw big problems with 
performance.

I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
lot of time with status waiting for opening table or waiting for 
close tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
areca SATA controller. Could not be problem in da device for example?


Thanks Tomas Randa

Garance A Drosihn wrote:

At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various 
incarnations for the last couple of months on our test server and 
it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] 
We've since then compiled the kernel under the BSD scheduler to 
rule that out, and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try 
that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any 
problems with it.  So *his* problem was probably not related to 
SCHED_ULE, unless something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's 
going to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that 
have existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development for 
at least five years, but it is possible that some problems will 
become visible as a result of the switch.  I would encourage people 
to stick with ULE, but if you're having a stability problem then 
experimenting with scheduler as a variable that could be triggering 
the problem may well be useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging 
problems.

So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(

 I have similar problems. The last good kernel I have from stable brach,
 october the 8. Then in next upgrade, I saw big problems with performance.
 I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

 Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot
 of time with status waiting for opening table or waiting for close
 tables

 I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca
 SATA controller. Could not be problem in da device for example?

It was mentioned previous in this thread that CPUTYPE could be an
issue. Did you change this if you customized your kernel?

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(


On Mon, 12 Jan 2009, Tomas Randa wrote:

I have similar problems. The last good kernel I have from stable brach, 
october the 8. Then in next upgrade, I saw big problems with performance. I 
tried ULE, 4BSD etc, but nothing helps, only downgrading system back.


Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot 
of time with status waiting for opening table or waiting for close 
tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca 
SATA controller. Could not be problem in da device for example?


So far, this sounds like a different problem than the one others have been 
posting about, which involves full system freezes rather than specific 
processes wedging or responding poorly.  I'd suggest starting by using 
procstat -k on the process ID to look at where specific threads are waiting 
in the kernel.  Is it simply that MySQL is being unreasonably slow in certain 
situations, or does it actually entirely stop operating?


If you're able to narrow down the date on the 7.x branch where the problem 
you're experiencing begins, that would be most helpful.  I'd suggest leaving 
your userspace on the 8th october, and sliding the kernel forward in a binary 
search until you've narrowed it down a bit.  Obviously, this takes a bit of 
patience, but narrowing it down could be quite informative.


Robert N M Watson
Computer Laboratory
University of Cambridge



Thanks Tomas Randa

Garance A Drosihn wrote:

At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various incarnations 
for the last couple of months on our test server and it has performed 
perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] We've 
since then compiled the kernel under the BSD scheduler to rule that out, 
and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try 
that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any problems 
with it.  So *his* problem was probably not related to SCHED_ULE, unless 
something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's going 
to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that have 
existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development for at 
least five years, but it is possible that some problems will become 
visible as a result of the switch.  I would encourage people to stick with 
ULE, but if you're having a stability problem then experimenting with 
scheduler as a variable that could be triggering the problem may well be 
useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging problems.
So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(



On Mon, 12 Jan 2009, Pete French wrote:

I'm not sure if you've done this already, but the normal suggestions apply: 
have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
any results / panics / etc result?  Sometimes these debugging tools are 
able to convert hangs into panics, which gives us much more ability to 
debug them.


OK, I have now had a machine hand again, with the correct debug options in 
the kernel. The screen looked like this when I went to restart it:


http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear to 
me console messages after the lock order reversal - is that normal ?


Lock order reversals are warnings of potential deadlock due to a lock cycle, 
but deadlocks may not actually result, either because it's a false positive 
(some locking construct that is deadlock free but involves lock cycles), or 
because a cycle didn't actually form.  The message is suggestive, but if you 
have significant system activity after the message, then it may be unrelated.


The machine did stay up for a signifanct amount of time before doing this. I 
notice that it is more or less identical to the one I posted whenI had 
WITNESS_KDB in the kernel too, so maybe those results arent entirely 
suprious after all ?


Given it hasnt dropped to a debugger, is there anything else I can try ?


Features like WITNESS and INVARIANTS may change the timing of the kernel 
making certain race conditions less likely; I'd run with them for a bit and 
see if you can reproduce the hang with them present, as they will make 
debugging the problem a lot easier, if it's possible.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Big problems with 7.1 locking up :-(