Re: Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()
Hi, On 26 Feb 2009, at 2:28, Charles Sprickman wrote: On Wed, 25 Feb 2009, Robert Watson wrote: Just a minor heads up: I've merged both Kip Macy's lock order fixes to the kernel routing code, and the route locking and reference counting fixes from kern/130652 to stable/7. These fixes should correct a number of reported network-related hangs. We might want to release a subset of these as an errata patch to 7.1 if they shake out well in 7-stable. +1 Charles Unfortunately these changes let my system panic during early boot, around the time when ppp/routing is started. Backing out these changes prevents the panic. I've filed this with some textdumps as: kern/132404: panic sleeping thread after 25th Feb src/sys/net commits http://www.freebsd.org/cgi/query-pr.cgi?pr=132404 Regards, Ruben ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) Just to let you know that I have had zero crashes since I out the patch live on sunday. Of course thats only three days, but it does look very much like it has fixed it. I am also running with the other routing table patch too.. At this point no news is good news, as it is just sitting there ticking away nicely to itself. I will roll it out to a few more machines over the next few days. But looking good so far, I would encourage other people to try the ptches if they are having problems... -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Wed, 25 Feb 2009, Pete French wrote: FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) Just to let you know that I have had zero crashes since I out the patch live on sunday. Of course thats only three days, but it does look very much like it has fixed it. I am also running with the other routing table patch too.. At this point no news is good news, as it is just sitting there ticking away nicely to itself. I will roll it out to a few more machines over the next few days. But looking good so far, I would encourage other people to try the ptches if they are having problems... Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can look at the PR and get that in-progress. Since the code affected by the PR is no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly since you've had it in production for a while. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()
Just a minor heads up: I've merged both Kip Macy's lock order fixes to the kernel routing code, and the route locking and reference counting fixes from kern/130652 to stable/7. These fixes should correct a number of reported network-related hangs. We might want to release a subset of these as an errata patch to 7.1 if they shake out well in 7-stable. Thanks again, especially, to Pete for his evaluation of bugs and patches, Kip for his fixes in head, and to Dmitrij Tejblum for his submission of the fixes in the above-mentioned PR. Robert N M Watson Computer Laboratory University of Cambridge On Wed, 25 Feb 2009, Robert Watson wrote: On Wed, 25 Feb 2009, Pete French wrote: FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) Just to let you know that I have had zero crashes since I out the patch live on sunday. Of course thats only three days, but it does look very much like it has fixed it. I am also running with the other routing table patch too.. At this point no news is good news, as it is just sitting there ticking away nicely to itself. I will roll it out to a few more machines over the next few days. But looking good so far, I would encourage other people to try the ptches if they are having problems... Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can look at the PR and get that in-progress. Since the code affected by the PR is no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly since you've had it in production for a while. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Wed, Feb 25, 2009 at 11:04:29AM +, Robert Watson wrote: On Wed, 25 Feb 2009, Pete French wrote: FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) Just to let you know that I have had zero crashes since I out the patch live on sunday. Of course thats only three days, but it does look very much like it has fixed it. I am also running with the other routing table patch too.. At this point no news is good news, as it is just sitting there ticking away nicely to itself. I will roll it out to a few more machines over the next few days. But looking good so far, I would encourage other people to try the ptches if they are having problems... Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can look at the PR and get that in-progress. Since the code affected by the PR is no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly since you've had it in production for a while. Great! I hope this patch will also fix the mysterious hangs I've experienced on Soekris routers since nov/dec 2008. Will try in a few days and report back any further hangs. Robert N M Watson Computer Laboratory University of Cambridge Regards, -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()
On Wed, 25 Feb 2009, Robert Watson wrote: Just a minor heads up: I've merged both Kip Macy's lock order fixes to the kernel routing code, and the route locking and reference counting fixes from kern/130652 to stable/7. These fixes should correct a number of reported network-related hangs. We might want to release a subset of these as an errata patch to 7.1 if they shake out well in 7-stable. +1 Charles Robert N M Watson Computer Laboratory University of Cambridge On Wed, 25 Feb 2009, Robert Watson wrote: On Wed, 25 Feb 2009, Pete French wrote: FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) Just to let you know that I have had zero crashes since I out the patch live on sunday. Of course thats only three days, but it does look very much like it has fixed it. I am also running with the other routing table patch too.. At this point no news is good news, as it is just sitting there ticking away nicely to itself. I will roll it out to a few more machines over the next few days. But looking good so far, I would encourage other people to try the ptches if they are having problems... Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can look at the PR and get that in-progress. Since the code affected by the PR is no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly since you've had it in production for a while. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 23 Feb 2009, aneeth wrote: http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= OK, will give this a try, unless anyone else wants any traces from this locked machine ? Is there a known way to tickle this bug when I've rebooted, to make sure it's fixed ? We'v been having similar issues with a couple of our servers as well (7.0 and 7.1). However the problem shows up only on quad core machines. The dual core machines r running fine. FYI, I'm currently awaiting testing results from Pete on the MFC of a number of routing table locking fixes, and once that's merged (hopefully tomorrow?) I'll start on the patches in the above PR. I've taken a crash-course in routing table locking in the last few days... :-) The patches I sent him are at: http://www.watson.org/~robert/freebsd/20090221-route-locking.diff They do not include the patch from the above PR which I want to handle separately as it's a significantly different issue. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete French-2 wrote: Probably it is your case, try please. http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= OK, will give this a try, unless anyone else wants any traces from this locked machine ? Is there a known way to tickle this bug when I've rebooted, to make sure it's fixed ? thanks, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org We'v been having similar issues with a couple of our servers as well (7.0 and 7.1). However the problem shows up only on quad core machines. The dual core machines r running fine. -- View this message in context: http://www.nabble.com/Big-problems-with-7.1-locking-up-%3A-%28-tp21364913p22176398.html Sent from the freebsd-stable mailing list archive at Nabble.com. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Tue, 17 Feb 2009, Mike Tancsa wrote: Do you have any other details about these issues ? Were the fixes ever MFC'd Earlier today I handed off some patches for Pete to test (attached below), which he's running alongside the patches in kern/130652. When I run with the patches, basically an MFC of a subset of Kip's routing improvements in 8.x, I can no longer reproduce the lock reversal, which will hopefully mean Pete can no longer reproduce the hang. I plan to merge these in a couple of days once (with any luck) he's confirmed that is the case. We may want to get a subset of this patch on the errata note path, if we can get the ICMP redirect fix down to a very short patch. Robert N M Watson Computer Laboratory University of Cambridge Merge r185747, r185774, r185807, r185849, r185964, r185965, r186051, r186052 from head to stable/7; note that only the locking fixes and invariants checking are added from r185747, but not the move to an rwlock which would modify the kernel binary interface, nor the move to a non-recursible lock, which is still seeing problem reports in head. This corrects, among other things, a deadlock that may occur when processing incoming ICMP redirects. r185747: - convert radix node head lock from mutex to rwlock - make radix node head lock not recursive - fix LOR in rtexpunge - fix LOR in rtredirect Reviewed by: sam r185774: - avoid recursively locking the radix node head lock - assert that it is held if RTF_RNH_LOCKED is not passed r185807: Fix a bug introduced in r185747: rather than dereferencing an uninitialized *rt to something undefined, use the fibnum that came in as function argument. Found with: Coverity Prevent(tm) CID: 4168 r185849: fix a reported panic when adding a route and one hit here when deleting a route - pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held - make sure the rnh lock is held across rt_setgate and rt_getifa_fib r185964: Pass RTF_RNH_LOCKED to rtalloc1 sunce the node head is locked, this avoids a recursive lock panic on inet6 detach. Reviewed by: kmacy r185965: RTF_RNH_LOCKED needs to be passed in the flags arg not report, apologies to thompsa r186051: in6_addroute is called through rnh_addadr which is always called with the radix node head lock held exclusively. Pass RTF_RNH_LOCKED to rtalloc so that rtalloc1_fib will not try to re-acquire the lock. r186052: don't acquire lock recursively All original commits to head were by Kip Macy kmacy, except r185964 by thompsa. Reviewed by:bz Tested by: Pete French petefrench at ticketswitch com Property changes on: sys ___ Modified: svn:mergeinfo Merged /head/sys:r185747,185774,185807,185849,185964-185965,186051-186052 Index: sys/netinet/in_rmx.c === --- sys/netinet/in_rmx.c(revision 188767) +++ sys/netinet/in_rmx.c(working copy) @@ -111,7 +111,7 @@ * ARP entry and delete it if so. */ rt2 = in_rtalloc1((struct sockaddr *)sin, 0, - RTF_CLONING, rt-rt_fibnum); + RTF_CLONING|RTF_RNH_LOCKED, rt-rt_fibnum); if (rt2) { if (rt2-rt_flags RTF_LLINFO rt2-rt_flags RTF_HOST Property changes on: sys/dev/cxgb ___ Modified: svn:mergeinfo Merged /head/sys/dev/cxgb:r185747,185774,185807,185849,185964-185965,186051-186052 Property changes on: sys/dev/ath/ath_hal ___ Modified: svn:mergeinfo Merged /head/sys/dev/ath/ath_hal:r185747,185774,185807,185849,185964-185965,186051-186052 Index: sys/net/route.c === --- sys/net/route.c (revision 188767) +++ sys/net/route.c (working copy) @@ -277,6 +277,7 @@ struct rt_addrinfo info; u_long nflags; int err = 0, msgtype = RTM_MISS; + int needlock; KASSERT((fibnum rt_numfibs), (rtalloc1_fib: bad fibnum)); if (dst-sa_family != AF_INET) /* Only INET supports 1 fib now */ @@ -290,7 +291,13 @@ rtstat.rts_unreach++; goto miss2; } - RADIX_NODE_HEAD_LOCK(rnh); + needlock = !(ignflags RTF_RNH_LOCKED); + if (needlock) + RADIX_NODE_HEAD_LOCK(rnh); +#ifdef INVARIANTS + else + RADIX_NODE_HEAD_LOCK_ASSERT(rnh); +#endif if ((rn = rnh-rnh_matchaddr(dst, rnh)) (rn-rn_flags RNF_ROOT) == 0) { /* @@ -343,7 +350,8 @@ RT_LOCK(newrt); RT_ADDREF(newrt); } - RADIX_NODE_HEAD_UNLOCK(rnh); + if
Re: Big problems with 7.1 locking up :-(
On Tue, 17 Feb 2009, Mike Tancsa wrote: At 05:38 PM 1/29/2009, Robert Watson wrote: On Fri, 9 Jan 2009, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. For those following this other than Pete, who I've been in private correspondence with: it seems that he is running into two different deadlocks in the routing code. One of them (at least) is triggered by a lock order problem relating to the processing of ICMP redirects -- uncommon in most configurations, but quite a few on his network, which triggers quickly under load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Do you have any other details about these issues ? Were the fixes ever MFC'd Hi Mike, et al, I gave Kip a ping about MFCing the fixes and he said he would do that, but has apparently been preoccupied. I'm working on an MFC patch currently, but as I'm not all that familiar with the routing code, and the bug fixes were mixed with feature enhancements in his original commits, it will probably take me a bit longer to produce a candidate patch. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
At 05:38 PM 1/29/2009, Robert Watson wrote: On Fri, 9 Jan 2009, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. For those following this other than Pete, who I've been in private correspondence with: it seems that he is running into two different deadlocks in the routing code. One of them (at least) is triggered by a lock order problem relating to the processing of ICMP redirects -- uncommon in most configurations, but quite a few on his network, which triggers quickly under load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Hi Robert, Do you have any other details about these issues ? Were the fixes ever MFC'd ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Hi, Just to let you know what's going on with the issue. I tried kern.hz=100 on GENERIC 7.1, but the soekris started rebooting with ethernet only traffic. I made a custom kernel with RELENG_7 from 13.Feb and: options CPU_SOEKRIS options CPU_GEODE The soekris is quite stable now and I'm unable to freeze it so far :) On Feb 8, 2009, at 10:28 PM, Mike Tancsa wrote: At 10:11 AM 2/8/2009, Stefan Lambrev wrote: Hi all, In this thread someone mention a problem with soekris devices. I personally have one of those new soekris devices and installed 7.1R and it is very easy to freeze it. All that I have to do is to copy big file vfer WIFI (atheros) with speed higher then 1-2MB/s. Try and copy across the ethernet. I have several RELENG_7 boxes deployed on soekris and Alix boards (same chipset pretty well) and have not seen any stability issues. ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Guy Helmer wrote: FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE that is causing my Supermicro dual Xeon machines to wedge. I did the binary search between 2008-10-02 and 2008-11-24 without reproducing any lockups, and then I went on to search between 2008-11-24 and 2009-01-04. An SMP kernel build from 2008-12-22 (r186409) sources was stable for over two weeks; a kernel built from 2008-12-29 (r186590) sources wedged in under 24 hours under moderate load. It appears that the significant changes between r186409 and r186590 were r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij - reverted bce changes). My machines don't have bce interfaces, so I suspect the ATA changes. Never mind. I'm stepping back through older kernels and finding that the hangs are now occurring in kernels that had seemed to be stable... Guy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Guy Helmer wrote: Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. Since then I have starte seeing machines lock up. This always happens under heavy disc load. When I bring the machine back up then sometimes it fails to fsck due to a partialy truncated inode. The locksup appear to be disc related - on my mysql msater machine it will come back up with files somewhat shorted than those which ahve aready been transmitted to the slave (i.e. some data was in memory, and claimed to have been written to the drive, but never made it onto the disc). The only time I have seen anything useful on the screen was during one lockup where I got a message about a spin lock being held too long and some comment in parentheses about it being a turnstile lock. Help! :-( I am now downgrading all the machine to 7.0 as fast as I can - though the machine I am trying to compile it on has locked up once during the compile so I havent got anywhere so far. The machines are HP Proliant DL360 G5s - they have an embedded P400i RAID controller with a pair of mirrored drives connected. Each one has both ethernets connected, bundled using lagg and LACP. I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Guy FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE that is causing my Supermicro dual Xeon machines to wedge. I did the binary search between 2008-10-02 and 2008-11-24 without reproducing any lockups, and then I went on to search between 2008-11-24 and 2009-01-04. An SMP kernel build from 2008-12-22 (r186409) sources was stable for over two weeks; a kernel built from 2008-12-29 (r186590) sources wedged in under 24 hours under moderate load. It appears that the significant changes between r186409 and r186590 were r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij - reverted bce changes). My machines don't have bce interfaces, so I suspect the ATA changes. Any thoughts? Thanks, Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Hi, just wondering if we are any closer to having the MFC for this yet, or if there are any patches I could test ? cheers, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Hi all, In this thread someone mention a problem with soekris devices. I personally have one of those new soekris devices and installed 7.1R and it is very easy to freeze it. All that I have to do is to copy big file vfer WIFI (atheros) with speed higher then 1-2MB/s. It takes less then 2 minutes to freeze. I wonder if there is some improvement in 7.1-stable so I can try it or if I can help by compiling debug kernel? But I'm not sure if this is the same problem as it may be just the wireless driver in my case. On Feb 8, 2009, at 3:11 PM, Pete French wrote: load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Hi, just wondering if we are any closer to having the MFC for this yet, or if there are any patches I could test ? cheers, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Sun, Feb 08, 2009 at 05:11:02PM +0200, Stefan Lambrev wrote: Hi all, In this thread someone mention a problem with soekris devices. I personally have one of those new soekris devices and installed 7.1R and it is very easy to freeze it. All that I have to do is to copy big file vfer WIFI (atheros) with speed higher then 1-2MB/s. It takes less then 2 minutes to freeze. I wonder if there is some improvement in 7.1-stable so I can try it or if I can help by compiling debug kernel? But I'm not sure if this is the same problem as it may be just the wireless driver in my case. One some net4801's without WIFI, I also experience frequent freezes after a couple of hours up to 2-5 days... so it's probably not only ath related. What's your kern.hz value? In my /boot/loader.conf, it is set to 100. Could you try it too, and see if you can still freeze the box (just to rule out some weird timing / interrupt issue)? Best Wishes, Stefan Lambrev ICQ# 24134177 Regards, -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
At 10:11 AM 2/8/2009, Stefan Lambrev wrote: Hi all, In this thread someone mention a problem with soekris devices. I personally have one of those new soekris devices and installed 7.1R and it is very easy to freeze it. All that I have to do is to copy big file vfer WIFI (atheros) with speed higher then 1-2MB/s. Try and copy across the ethernet. I have several RELENG_7 boxes deployed on soekris and Alix boards (same chipset pretty well) and have not seen any stability issues. ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, 9 Jan 2009, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. For those following this other than Pete, who I've been in private correspondence with: it seems that he is running into two different deadlocks in the routing code. One of them (at least) is triggered by a lock order problem relating to the processing of ICMP redirects -- uncommon in most configurations, but quite a few on his network, which triggers quickly under load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Robert N M Watson Computer Laboratory University of Cambridge Since then I have starte seeing machines lock up. This always happens under heavy disc load. When I bring the machine back up then sometimes it fails to fsck due to a partialy truncated inode. The locksup appear to be disc related - on my mysql msater machine it will come back up with files somewhat shorted than those which ahve aready been transmitted to the slave (i.e. some data was in memory, and claimed to have been written to the drive, but never made it onto the disc). The only time I have seen anything useful on the screen was during one lockup where I got a message about a spin lock being held too long and some comment in parentheses about it being a turnstile lock. Help! :-( I am now downgrading all the machine to 7.0 as fast as I can - though the machine I am trying to compile it on has locked up once during the compile so I havent got anywhere so far. The machines are HP Proliant DL360 G5s - they have an embedded P400i RAID controller with a pair of mirrored drives connected. Each one has both ethernets connected, bundled using lagg and LACP. Advice ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete, Have you considered enabling serial console emulation in the BIOS on the machines. I have got my iLo cards set up to redirect the serial ports on my HP servers so that I can ssh into the ILO cards and by typing Esc-Q access what I would otherwise see on a serial console. Unfortunately I don't have a DL360 to try and reproduce your problem on. Regards, Cian. On 12 Jan 2009, at 15:16, Pete French wrote: I cant add a sserial console - I am remote enough from most of these machines (Slough) and very remote from the test box (its in the USA!) so I cant get to them physicly. But I do have iLo which lets me use the console and gives me a bit of access to the front. I will check for NMI. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete Carah wrote: Well, following up on my own reply earlier, I csup'd releng_7 with a date of last dec 1; the result works fine in the laptop. I'll reload the eastern soekris tonight and see how it does. If the soekris is fine also then this gives a data point for whenever the bad commit(s) happened. I had apparently made the mistaken assumption that a general release should be better debugged than the work-in-progress leading up to it... I'm sorry FreeBSD has failed to live up to your expectations. As ever, we strive to fix more bugs than we introduce, but in a changing codebase this is never possible to guarantee or to always achieve. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete Carah wrote: I have done some (lots of) kernel debugging in the past. I have several points: 1. I shouldn't *have* to kernel debug for a normal usage of an official release. Um, why not? We certainly put every possible effort into making sure that releases (and in fact stable branches in general) do not regress, but it is inevitable that with the extraordinarily wide range of hardware and workloads that run FreeBSD there will be regressions. Mark has already commented on the need for wider community review of release candidates, so I won't go there. Helping the community understand and hopefully fix the problem when something goes pear-shaped for you is part of the price of free software. If you're unwilling (or unable, as you indicate below) to pay that price, you may just have to find another tool for the job. 2. One of the soekris boxes is 2800 MILES away, in a remote location, That's what serial consoles and remote power switches are for. It's not our fault if you don't have access to those. 3. I can't afford the time to debug my tools (freebsd is a tool, not an experiment, for lots of people, including me...) I use this laptop at work in a place where I am *not* working on freebsd. (nor am I even allowed to at work...) Doug (who's still a volunteer, last time I checked) -- This .signature sanitized for your protection ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
yes, do ps - threads in state L or LL and RUN are especially interesting, trace of pids 28, 27, and threads wich L on locked chan. heres the output of alllocks, http://toybox.twisted.org.uk/~pete/71_show_alllocks.png here are the pages of PS: http://toybox.twisted.org.uk/~pete/71_lock_ps2/ (next time I boot this I will disable http to avoid getting so many) I cant see any which are in L, LL or RUN state there though. A few RL and WL towards the end. Traces on 28 and 27 are here: http://toybox.twisted.org.uk/~pete/71_trace_28.png http://toybox.twisted.org.uk/~pete/71_trace_27a.png http://toybox.twisted.org.uk/~pete/71_trace_27b.png I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU state, so may be of interest ? http://toybox.twisted.org.uk/~pete/71_trace_19.png http://toybox.twisted.org.uk/~pete/71_trace_16.png -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, Jan 19, 2009 at 11:39:08AM +, Pete French wrote: yes, do ps - threads in state L or LL and RUN are especially interesting, trace of pids 28, 27, and threads wich L on locked chan. heres the output of alllocks, http://toybox.twisted.org.uk/~pete/71_show_alllocks.png here are the pages of PS: http://toybox.twisted.org.uk/~pete/71_lock_ps2/ (next time I boot this I will disable http to avoid getting so many) I cant see any which are in L, LL or RUN state there though. A few RL and WL towards the end. Traces on 28 and 27 are here: http://toybox.twisted.org.uk/~pete/71_trace_28.png http://toybox.twisted.org.uk/~pete/71_trace_27a.png http://toybox.twisted.org.uk/~pete/71_trace_27b.png I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU state, so may be of interest ? http://toybox.twisted.org.uk/~pete/71_trace_19.png http://toybox.twisted.org.uk/~pete/71_trace_16.png Probably it is your case, try please. http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= -- Have fun! chd ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Probably it is your case, try please. http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= OK, will give this a try, unless anyone else wants any traces from this locked machine ? Is there a known way to tickle this bug when I've rebooted, to make sure it's fixed ? thanks, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Kris writes: You and anyone else seeing performance problems should try to work through the advice given here: [1]http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf Well, all the people in this thread have noticed that WITH NO CONFIG CHANGES f rom configs that worked fine in the past, their systems are very slow and/or locking up (mi ne are both) with the stable branch sometime (I noticed it sometime in December, but it got worse with the release.) Most were OK in October; mine (I think) were OK in late November - may narrow t hings down? Two of my systems that lock up have no internal visibility when they do (Soekris 4801's r outing; the only time-intensive things running are routing (done in irq context) and pflog. The se run with 60+ meg ram free.) These are complete lockups, though I did manage to get a ps out of my laptop last night by waiting 20 _minutes_ for it to start (!). This is not a generic perfo rmance problem. The laptop had 55 minutes of cpu time in the softdepflush thread after being up about an h our and 10 mins; this might give a hint. I didn't spot LL/RL state threads at the same time bec ause I didn't know to. Now I do. BTW - the same ps showed 8 or so user-space procs in R state wi th NO cpu time; the kernel was hogging all of it for over an hour. Firefox did indeed trigger this one as someone else noted. A soekris doing onl y routing+nat has no such excuse... At least PHK was nice enough to note the watchdog in another thread :-) -- Pete References 1. http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Probably it is your case, try please. http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= Well, I have been running this for a while now. I still get this: http://toybox.twisted.org.uk/~pete/71_lor3.png On the console, but so far the machine has not crashed. Obviously it's only been an hour or so as yet, buit given that it was freezing in about 5 minutes earlier this morning it does look good. So thanks for a good patch so far ;-) -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat= Looks like I spoke too soon - It just locked up again I am afraid. Sitting there now at the debug prompt. It does, however, look very different this time: For example here is 'show alllocks': http://toybox.twisted.org.uk/~pete/71_alllocks2.png That shows a lot of locks in UDP - is this the kind of thing you were worried about Robert ? When I do a 'ps' there are, this time, a number of processes in the 'LL' and 'L' state. The images of the 'ps' and the traces or those locked processed are to be found here: http://toybox.twisted.org.uk/~pete/71_lock_ps3/ I tried to keep the threads which belong to each process together. What else can I get out of this lockup ? It looks like the most promising so far... -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
There are significant changes in UDP locking between 7.0 and 7.1, so it could be that we're looking at a regression there. If you're able to reproduce this reliably, it might well be worth doing a little search-and-replace in udp_usrreq.c along the following lines: INP_RLOCK_ASSERT - INP_WLOCK_ASSERT INP_RLOCK - INP_WLOCK INP_RUNLOCK - INP_WUNLOCK Given that the latest lockup (see other email) has lots of locks in the UDP code, would you like me to try this next ? The kernel which has just locked is one using Dimtry's patch from http://www.freebsd.org/cgi/query-pr.cgi?pr=130652 I am not sure why that would give me different traces during the lockup though. I was doing a lot more TCP traffic this time, but that shouldnt interfere with UDP should it ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete Carah wrote: Kris writes: You and anyone else seeing performance problems should try to work through the advice given here: http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf Well, all the people in this thread have noticed that WITH NO CONFIG CHANGES from configs that worked fine in the past, their systems are very slow and/or locking up (mine are both) with the stable branch sometime (I noticed it sometime in December, but it got worse with the release.) Most were OK in October; mine (I think) were OK in late November - may narrow things down? Two of my systems that lock up have no internal visibility when they do (Soekris 4801's routing; the only time-intensive things running are routing (done in irq context) and pflog. These run with 60+ meg ram free.) These are complete lockups, though I did manage to get a ps out of my laptop last night by waiting 20 _minutes_ for it to start (!). This is not a generic performance problem. The laptop had 55 minutes of cpu time in the softdepflush thread after being up about an hour and 10 mins; this might give a hint. I didn't spot LL/RL state threads at the same time because I didn't know to. Now I do. BTW - the same ps showed 8 or so user-space procs in R state with NO cpu time; the kernel was hogging all of it for over an hour. Firefox did indeed trigger this one as someone else noted. A soekris doing only routing+nat has no such excuse... At least PHK was nice enough to note the watchdog in another thread :-) Actually, there have been several apparently different problems reported in this thread, some of which (including the message I replied to) *are* generic my system is slower problems. For generic my system hangs problems, see the chapter on kernel debugging in the handbook or follow the (same) advice given by Robert earlier in the thread. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I have done some (lots of) kernel debugging in the past. I have several points: 1. I shouldn't *have* to kernel debug for a normal usage of an official release. 2. One of the soekris boxes is 2800 MILES away, in a remote location, with noone present that is a skilled (or, indeed, any kind of) programmer. I usually thought I could trust a release, especially when I had been using the stable branch updated at about monthly intervals on 3 servers with no problems. (actually, I waited a while on 7.0 because .0 releases are traditionally quirky; in this case 7.0-rel worked fine and 7.1 has problems.) (and my servers are still running the *same* compilation of kernel/world with no problems; the hangs are unique to either the laptop (which only started doing this badly with a Jan 9 csup) and the Soekris boxes (which started hangs sometime in December; they clearly don't run X...) [ I've backed my house source to -stable of 12/1/08 and hope this will help; I don't have the time to fool around too much, and particularly to kernel debug something that shouldn't need it.] I can't even start X at all on this laptop now. At least I can boot it, but it isn't much use for work unless it can run X. 3. I can't afford the time to debug my tools (freebsd is a tool, not an experiment, for lots of people, including me...) I use this laptop at work in a place where I am *not* working on freebsd. (nor am I even allowed to at work...) -- Pete ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, Jan 19, 2009 at 04:59:59PM -0500, Pete Carah wrote: I shouldn't *have* to kernel debug for a normal usage of an official release. Agreed, but the problems that people are having do not seem to have arisen on any of the systems that ran prelease tests for 7.1. Although I'm sure it does not seem that way to you, 7.1R had a very long QA cycle, and as far as I knew all the showstopper issues had already been addressed (although I don't officially speak for re@, I'm just an observer.) With my bugmeister hat on, I'll happily accept suggestions about how we can get more people involved in testing the prelease images. Clearly the situtation we're in right now is not where anyone wants us to be. mcl ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Well, following up on my own reply earlier, I csup'd releng_7 with a date of last dec 1; the result works fine in the laptop. I'll reload the eastern soekris tonight and see how it does. If the soekris is fine also then this gives a data point for whenever the bad commit(s) happened. I had apparently made the mistaken assumption that a general release should be better debugged than the work-in-progress leading up to it... As I noted before - I'm in the business of *using* computers, not doing fbsd kernel work (I actually do linux kernel (device driver) work in $dayjob, but so far prefer fbsd for general use, like routers and servers.) I need to regen the soekris config here with the 12.01 also; if it doesn't hang either then I can hope that someone can look through commit notes (I certainly don't have the time or internal knowledge of 7.x to do so) and try figure out what may have happened. My daughter is tired of rebooting the soekris that is 2800 miles west of here. One extra data point: the systems that work OK with the release code have Intel chipsets (older - ich3 and ich5). The laptop is an AMD64 in 32-bit mode with an ATI chipset and broadcom wireless (hence uses project evil, which has its own problems with hangs). Soekris is Geode SC1100 with its own builtin chipset, presumably a mish-mash of things from Cyrix, National, and AMD, given the Geode series's history. It is not possible to gen a system or kernel on the soekris; gcc 4.x won't run in 128mb of ram with no swap (they run from cf cards; swap is not possible.) I have to cross-compile them and reload the cf cards externally if possible; if not I use nfs (which breaks the system badly if it hangs during a make install; this happened this past weekend, fortunately not on the other coast :-( -- Pete ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Tomas Randa wrote: Hello, I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status waiting for opening table or waiting for close tables I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in da device for example? You and anyone else seeing performance problems should try to work through the advice given here: http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Tomas Randa wrote: Hello, I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I can add a me too here. This is on my desktop, very lightly loaded. This computer never had a single problem under FreeBSD so i don't suspect a hardware problem. My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked perfectly fine with exactly the same software configuration. Now i have FreeBSD 7.1-STABLE #0: Mon Jan 5 , and the situation is disastrous. Freshly after boot the machine seems to work normal, but after a few days it becomes slower and slower, windows takes seconds to appear, firefox3 begins to have garbled output, etc. Then i had the following problem, firefox got stuck in kernel, impossible to kill it by kill -9. Needless to say i inspected everything, dmesg, xsession-errors, top, etc. without seeing anything suspicious. So i rebooted, and bingo! the machine paniced, mentioning firefox. But the panic itself get stuck and i had to push the reset button, so no dump. After reboot, machine works OK for two or three days, then problems begin again. I am convinced there is a big problem in the kernel. For reference, here is top and dmesg: CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 264M Active, 613M Inact, 485M Wired, 22M Cache, 112M Buf, 116M Free Swap: 2023M Total, 4K Used, 2023M Free PID USERNAME THR PRI NICE SIZERES STATE C TIME WCPU COMMAND 62965 michel 1 440 3532K 1884K CPU1 1 0:00 0.29% top 2327 root 1 440 161M 29228K select 1 30:39 0.00% Xorg 95937 root 1 440 24112K 16800K select 1 2:35 0.00% kdm-bin_gr 3099 root 1 40 3304K 1028K select 0 1:30 0.00% moused 2209 news 1 80 3464K 1052K wait 0 0:37 0.00% sh 884 root 1 440 4712K 2028K select 1 0:12 0.00% ntpd 453 _pflogd 1 -580 3380K 1352K bpf0 0:11 0.00% pflogd 1634 www 1 40 6268K 2656K kqread 0 0:10 0.00% lighttpd 788 root 1 440 3164K 3184K select 0 0:04 0.00% amd 2206 news 1 440 15208K 12160K select 0 0:03 0.00% innd 879 root 9 40 5432K 2460K kqread 1 0:02 0.00% nscd 955 root 1 440 2736K 1216K select 1 0:02 0.00% master 758 root 1 440 3164K 1340K select 1 0:02 0.00% ypbind ... so no memory problem Copyright (c) 1992-2009 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.1-STABLE #0: Mon Jan 5 14:29:23 CET 2009 mic...@niobe.lpthe.jussieu.fr:/usr/obj/usr/src/sys/NIOBE Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 3.06GHz (3073.65-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf27 Stepping = 7 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4400CNXT-ID,xTPR Logical CPUs per core: 2 real memory = 1610530816 (1535 MB) avail memory = 1568387072 (1495 MB) ACPI APIC Table: ASUS P4PE FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 This module (opensolaris) contains code covered by the Common Development and Distribution License (CDDL) see http://opensolaris.org/os/licensing/opensolaris_license/ ioapic0 Version 2.0 irqs 0-23 on motherboard acpi0: ASUS P4PE on motherboard acpi0: Overriding SCI Interrupt from IRQ 9 to IRQ 22 acpi0: [ITHREAD] acpi0: Power Button (fixed) acpi0: reservation of 0, a (3) failed acpi0: reservation of 10, 5ff0 (3) failed Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0xe408-0xe40b on acpi0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 agp0: Intel 82845G host to AGP bridge on hostb0 pcib1: ACPI PCI-PCI bridge at device 1.0 on pci0 pci1: ACPI PCI bus on pcib1 vgapci0: VGA-compatible display port 0xd800-0xd8ff mem 0xe000-0xefff,0xdf00-0xdf00 irq 16 at device 0.0 on pci1 uhci0: Intel 82801DB (ICH4) USB controller USB-A port 0xb800-0xb81f irq 16 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] uhci0: [ITHREAD] usb0: Intel 82801DB (ICH4) USB controller USB-A on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 on usb0 uhub0: 2 ports with 2 removable, self powered uhci1: Intel 82801DB (ICH4) USB controller USB-B port 0xb400-0xb41f irq 19 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] uhci1: [ITHREAD] usb1: Intel 82801DB (ICH4) USB controller USB-B on uhci1 usb1: USB revision 1.0 uhub1: Intel
Re: Big problems with 7.1 locking up :-(
On Sun, 18 Jan 2009 13:21:17 +0100 Michel Talon ta...@lpthe.jussieu.fr wrote: My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked perfectly fine with exactly the same software configuration. Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is disastrous. Makes you wonder on on earth could have changed that much between 7.0/7.1 Nice upgrade.. This should not happen on the same hardware! -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS sxce snv105 ++ + All that's really worth doing is what we do for others (Lewis Carrol) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked perfectly fine with exactly the same software configuration. Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is disastrous. Makes you wonder on on earth could have changed that much between 7.0/7.1 Nice upgrade.. This should not happen on the same hardware! There will always be changes when new features/options/enhancements are introduced. Me for my part have never had any serious trouble with FreeBSD what so ever since FreeBSD 5.1/2 when some kernel-limits had to be changed. My problem was solved with the help from this list. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
If you are able to get into the debugger, the normal commands would be most helpful, especially if you can log the results: It finally locked up, and ctrl-alt-esc got me into the debugger at last! is there anything else you want me to get whilst it is like that aside from: ps show lockedvnods show alllocks which I can go and capture as screenshots. I can probably sort out console access to it potentially if taht would eb useful whilst it is in this state ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
ps output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/ there are a lot of processes as this machine runes the same webservices as the actual webservers, just that nobody connects to them. show lockedvnods nothing - there are no locked vnodes show alllocks this gives me 'no suich command' theres a whole list of things I can show, but none of them look like all the locks. what about the locktree or the lockchain ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, Jan 16, 2009 at 12:35:49PM +, Pete French wrote: ps output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/ there are a lot of processes as this machine runes the same webservices as the actual webservers, just that nobody connects to them. show lockedvnods nothing - there are no locked vnodes show alllocks this gives me 'no suich command' theres a whole list of things I can show, but none of them look like all the locks. what about the locktree or the lockchain ? hi, please type: show lock 0xff0001254d20 and then show thread 0xXXX where X is 'owner' of previous output. -- Have fun! chd ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
hi, please type: show lock 0xff0001254d20 and then show thread 0xXXX where X is 'owner' of previous output. http://toybox.twisted.org.uk/~pete/71_pdns_lock.png That's in Power DNS - which is interesting because the one difference between the boxes that lock and those which dont is that the locking ones are serving DNS. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, Jan 16, 2009 at 01:34:14PM +, Pete French wrote: hi, please type: show lock 0xff0001254d20 and then show thread 0xXXX where X is 'owner' of previous output. http://toybox.twisted.org.uk/~pete/71_pdns_lock.png That's in Power DNS - which is interesting because the one difference between the boxes that lock and those which dont is that the locking ones are serving DNS. trace 832 -- Have fun! chd ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, 16 Jan 2009, Pete French wrote: hi, please type: show lock 0xff0001254d20 and then show thread 0xXXX where X is 'owner' of previous output. http://toybox.twisted.org.uk/~pete/71_pdns_lock.png That's in Power DNS - which is interesting because the one difference between the boxes that lock and those which dont is that the locking ones are serving DNS. I rather feared as much. Let's run down the path of perhaps there's a problem with the new UDP locking code for a bit and see where it takes us. Is it possible to run those boxes with WITNESS -- I believe that the fact that show alllocks is failing is because WITNESS isn't present. The other thing we can do is revert UDP to using purely write locks -- the risk there is that it might change the timing but not actually resolve the bug, so if we can analyze it a bit using WITNESS first that would be useful. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
trace 832 http://toybox.twisted.org.uk/~pete/71_trace_832_1.png http://toybox.twisted.org.uk/~pete/71_trace_832_2.png -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I rather feared as much. Let's run down the path of perhaps there's a problem with the new UDP locking code for a bit and see where it takes us. Is it possible to run those boxes with WITNESS -- I believe that the fact that show alllocks is failing is because WITNESS isn't present. Yes, I can do that. The only reason I wasn't running with WITNESS is that it didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if a simple GENERIC kernel would lock up when I added that. I will go back and add WITNESS when you tell me theres nothing more we can get out of this lock up (recompiling will involve restarting the machine so I loose the 'boekn to debugger' state). Should I add anything else ? Skip spinlocks ? Invariants ? The other thing we can do is revert UDP to using purely write locks -- the risk there is that it might change the timing but not actually resolve the bug, so if we can analyze it a bit using WITNESS first that would be useful. Yes, I will run with WITNESS and anything else you might want. Is there anything else you, or anyone else, wants from this kernel ? It may take another day to lock up when I've restarted it unfortunately. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, 16 Jan 2009, Pete French wrote: I rather feared as much. Let's run down the path of perhaps there's a problem with the new UDP locking code for a bit and see where it takes us. Is it possible to run those boxes with WITNESS -- I believe that the fact that show alllocks is failing is because WITNESS isn't present. Yes, I can do that. The only reason I wasn't running with WITNESS is that it didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if a simple GENERIC kernel would lock up when I added that. I will go back and add WITNESS when you tell me theres nothing more we can get out of this lock up (recompiling will involve restarting the machine so I loose the 'boekn to debugger' state). Should I add anything else ? Skip spinlocks ? Invariants ? The other thing we can do is revert UDP to using purely write locks -- the risk there is that it might change the timing but not actually resolve the bug, so if we can analyze it a bit using WITNESS first that would be useful. Yes, I will run with WITNESS and anything else you might want. Is there anything else you, or anyone else, wants from this kernel ? It may take another day to lock up when I've restarted it unfortunately. If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. WITNESS does a number of things, including tracking (and being judgemental about) lock order. One nice side effect of that tracking is that we keep track of a lot more lock state explicitly, so DDB's show allocks, show locks, etc, commands can build on that. show lockedvnods works without WITNESS, though, so your results so far suggest this is likely not related to vnode locking. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. WITNESS does a number of things, including tracking (and being judgemental about) lock order. One nice side effect of that tracking is that we keep track of a lot more lock state explicitly, so DDB's show allocks, show locks, etc, commands can build on that. show lockedvnods works without WITNESS, though, so your results so far suggest this is likely not related to vnode locking. Right, I've gone back to my DEBUG kernel which has a lot of options in it, including all the above. It has locked almost immediately luckily, so now I have it sitting at the debugger prompt. The output from 'show alllocks' is here: http://toybox.twisted.org.uk/~pete/71_show_alllocks.png Which of these are worth tracing ? -pte. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Just confinuing to look at this with the help of Dimity, and the output from 'bt' is here: http://toybox.twisted.org.uk/~pete/71_bt.png The top bit of that is from my 'show alllocks' the full version of whih is here: http://toybox.twisted.org.uk/~pete/71_show_alllocks.png -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Just an update on this - I tried the various kernels, but now the machine is not locking up at all. As I havent actually chnaged anything then this does not make me as happy as you might expect. I don;t know what to do now - I daare not upgrade the machines to an OS that I know locks, but if I cant make it lock then it is impossible to get any useful debugging info out of. maybe waiting for 7.2 is the best move... -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Thu, 15 Jan 2009, Pete French wrote: Just an update on this - I tried the various kernels, but now the machine is not locking up at all. As I havent actually chnaged anything then this does not make me as happy as you might expect. I don;t know what to do now - I daare not upgrade the machines to an OS that I know locks, but if I cant make it lock then it is impossible to get any useful debugging info out of. maybe waiting for 7.2 is the best move... Well, one slightly pessimistic (or realistic) view says that all software contains bugs, it's just a question of whether or not your workload and environment trigger those bugs in a noticeable way. Given the inconsistency of the symptoms, I wouldn't preclude something environmental: could it be that it was the bottom, or more likely, top box in a rack and that your air conditioning isn't quite as effective there when the outside temperature is above/below some threshold? Alternatively, could it be that the workload changed very slightly -- you're doing less DNS queries, or the network latency to the DNS server changed? Certainly, whoever gave the advise on checking BIOS revisions is right: you can spend a lot of time tracking down a bug to realize that one box has a slightly different BIOS rev and therefore does/doesn't suffer from an obscure SMI bug. In any case, if it starts to reproduceably recur, send out mail and we can see if we can track it down some more. BTW, did you establish if the version of iLo you have has a remote NMI? I seem to recall that some do, and being able to deliver an NMI is really quite valuable. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Given the inconsistency of the symptoms, I wouldn't preclude something environmental: could it be that it was the bottom, or more likely, top box in a rack and that your air conditioning isn't quite as effective there when the outside temperature is above/below some threshold? It's a possibility - but the two machines which were exhibiting the fault are in Slough and Baton Rouge respectively, so under very diferent cliatic conditions. Howevere, something, has chhnaged to make it stop locking up! The USA one was doing it every couple of hours at the start of the week, and the UK on wouldnt last more than half an hour at one point. Alternatively, could it be that the workload changed very slightly -- you're doing less DNS queries, or the network latency to the DNS server changed? Also a possibility - that workload is entirely dependent on customer behaviour which is an unpredictable beast! Certainly, whoever gave the advise on checking BIOS revisions is right: you can spend a lot of time tracking down a bug to realize that one box has a slightly different BIOS rev and therefore does/doesn't suffer from an obscure SMI bug. Yes, thats next on my list - make sure they are all on the same version. In any case, if it starts to reproduceably recur, send out mail and we can see if we can track it down some more. BTW, did you establish if the version of iLo you have has a remote NMI? I seem to recall that some do, and being able to deliver an NMI is really quite valuable. OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon demand, so that could be used if/whhen the fault crops up again. thanks, will let this lie for now and resurrect the thread when I can get some more useful data. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Thu, 15 Jan 2009, Pete French wrote: In any case, if it starts to reproduceably recur, send out mail and we can see if we can track it down some more. BTW, did you establish if the version of iLo you have has a remote NMI? I seem to recall that some do, and being able to deliver an NMI is really quite valuable. OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon demand, so that could be used if/whhen the fault crops up again. thanks, will let this lie for now and resurrect the thread when I can get some more useful data. Excellent WRT NMI. As long as you have DDB, KDB, and BREAK_TO_DEBUGGER compiled into the kernel, generating that should reliably get you into the debugger. If it's possible to keep running with INVARIANTS and WITNESS, or just INVARIANTS if WITNESS slows things down too much, that would be desirable. You might want to give the NMI a test run just to make sure it behaves as you think it should, though -- be aware that if DDB/KDB aren't compiled into the kernel, then an NMI will panic the box. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
desirable. You might want to give the NMI a test run just to make sure it behaves as you think it should, though -- be aware that if DDB/KDB aren't compiled into the kernel, then an NMI will panic the box. Unfortunately it does this... http://toybox.twisted.org.uk/~pete/71_nmi1.png That is locked up too - hitting return does nothing. I was hoping it was just garbled output but had actually gone to the debugger. Apparently not. Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which does work as I have tested it with CTRL_ALT_ESC. M -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Thu, 15 Jan 2009, Pete French wrote: desirable. You might want to give the NMI a test run just to make sure it behaves as you think it should, though -- be aware that if DDB/KDB aren't compiled into the kernel, then an NMI will panic the box. Unfortunately it does this... http://toybox.twisted.org.uk/~pete/71_nmi1.png That is locked up too - hitting return does nothing. I was hoping it was just garbled output but had actually gone to the debugger. Apparently not. Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which does work as I have tested it with CTRL_ALT_ESC. Er, that's rather upsetting. John, do you have any ideas about this? Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Thursday 15 January 2009 12:49:11 pm Robert Watson wrote: On Thu, 15 Jan 2009, Pete French wrote: desirable. You might want to give the NMI a test run just to make sure it behaves as you think it should, though -- be aware that if DDB/KDB aren't compiled into the kernel, then an NMI will panic the box. Unfortunately it does this... http://toybox.twisted.org.uk/~pete/71_nmi1.png That is locked up too - hitting return does nothing. I was hoping it was just garbled output but had actually gone to the debugger. Apparently not. Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which does work as I have tested it with CTRL_ALT_ESC. Er, that's rather upsetting. John, do you have any ideas about this? The rest of the thread I have no context on still. The garbage is due to competing panics I think. The problem is we don't single thread the printf's in 'trap_fatal()'. We should probably have some sort of simple spin lock thing in the x86 code to only allow 1 CPU at a time to run through that routine. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing ctrl-alt-break on the console to see if you can drop into the debugger, or issue a serial break on a serial console. Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained all the other stuff (WITNESS etc...). The end result... ...it no longer crashes :-( I am not sure what to make of that! Wat could adding this to the kernel possibly do which would make my problems go away ? Should I try just adding this option to my GENERIC kernel and seeing if that also gives me something stable ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Wed, 14 Jan 2009, Pete French wrote: If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing ctrl-alt-break on the console to see if you can drop into the debugger, or issue a serial break on a serial console. Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained all the other stuff (WITNESS etc...). The end result... ...it no longer crashes :-( I am not sure what to make of that! Wat could adding this to the kernel possibly do which would make my problems go away ? Should I try just adding this option to my GENERIC kernel and seeing if that also gives me something stable ? Yeah, that is unexpected -- the BREAK_TO_DEBUGGER path should have almost know effect on control flow, unlike, say, WITNESS, which significantly distorts timing. Is there any chance you picked up any of the recent fixes that went into RELENG_7 without noticing, and that perhaps one of those did it? With regard to what to do: if you didn't pick up a fix without noticing, yeah, I think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at least, masked) the problem. Generally with this sort of testing one has to be pretty rigorous in testing assumptions, because it's easy for changes to sneak in. Particularly annoying are seemingly innocuous code changes that do things like slightly rearrange kernel memory. FWIW, I suspect the various reports we are seeing reflect more than one problem, and that they must be relatively edge-case individually but reports of a few problems have lead to more coming out of the woodwork. Obviously, the problems are not edge-case to the people experiencing them... Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
effect on control flow, unlike, say, WITNESS, which significantly distorts timing. Is there any chance you picked up any of the recent fixes that went into RELENG_7 without noticing, and that perhaps one of those did it? With I'm pretty certian of that - I hav just been changing kernel config files, I havent actually csup'd at all. regard to what to do: if you didn't pick up a fix without noticing, yeah, I think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at least, masked) the problem. OK. I think I need at leats 4 kernels to try here: GENERIC (which should show the problenm), my original DEBUG (which also shows the problem) plus both of those with BREAK_TO_DEBUGGER included to see if that fixes it. Can I just add BREAK_TO_DEBUGGER on its own to a config file ? I was wondering if I need to include one of the other debugger options so that it has something to break to ? FWIW, I suspect the various reports we are seeing reflect more than one problem, and that they must be relatively edge-case individually but reports of a few problems have lead to more coming out of the woodwork. Obviously, the problems are not edge-case to the people experiencing them... I was thinking that too - I've been guilty of this in the past too, lumping my problem in with others under the asusmption that it's all the same. This is onbiously pretty rare - out of 24 of the HP servers the problems only crops up on 4 of them. But there is nothing dfferent about those 4. I will let you know what my various kerenl compiles give me - am buolding again from scratch, which is slow with WITNESS enabled. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
my problem in with others under the asusmption that it's all the same. This is onbiously pretty rare - out of 24 of the HP servers the problems only crops up on 4 of them. But there is nothing dfferent about those 4. Could it be different bios/firmware on the hp-servers? Mr. Aliyev was unable to install 7.1 release on amd64 on a DL380 G5. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete French wrote: Mine never lock up doing buildworlds either. They only lock up when they are sitting there more of less idle! The machines which have never locked up are the webservers, which are fairly heavlt loaded. The machine which locks up the most frequently is a box sitting there doing nothing but DNS, which is the most lightly loaded of the lot. Silly question but do you have powerd enabled on that server? If so, does disabling it help? Also do you have any of these in /etc/rc.conf (i.e., they are not the same as the default values in /etc/defaults/rc.conf): performance_cx_lowest=HIGH# Online CPU idle state performance_cpu_freq=NONE # Online CPU frequency economy_cx_lowest=HIGH# Offline CPU idle state economy_cpu_freq=NONE # Offline CPU frequency Doug -- This .signature sanitized for your protection ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Mine never lock up doing buildworlds either. They only lock up when they are sitting there more of less idle! The machines which have never locked up are the webservers, which are fairly heavlt loaded. The machine which locks up the most frequently is a box sitting there doing nothing but DNS, which is the most lightly loaded of the lot. The server has been idle for a day now and is up and running. I have then copied a file to generate some i/o and it copies without problems. for ((a=0;a10;a++)) do cp netbeans-6.5-ml-macosx.dmg ${a}.dmg done I can't (fortunately) make it lock up. I have a DL360 G5 which is unused atm. and can test on it if needed. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 2009-01-12 at 19:00 +, Pete French wrote: I'm not sure if you've done this already, but the normal suggestions apply: have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do any results / panics / etc result? Sometimes these debugging tools are able to convert hangs into panics, which gives us much more ability to debug them. OK, I have now had a machine hand again, with the correct debug options in the kernel. The screen looked like this when I went to restart it: http://toybox.twisted.org.uk/~pete/71_lor2.png It had not, however, dropped into any kind of debugger. Also there appear to me console messages after the lock order reversal - is that normal ? The machine did stay up for a signifanct amount of time before doing this. I notice that it is more or less identical to the one I posted whenI had WITNESS_KDB in the kernel too, so maybe those results arent entirely suprious after all ? Given it hasnt dropped to a debugger, is there anything else I can try ? Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break over the serial line? Gavin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Lock order reversals are warnings of potential deadlock due to a lock cycle, but deadlocks may not actually result, either because it's a false positive (some locking construct that is deadlock free but involves lock cycles), or because a cycle didn't actually form. The message is suggestive, but if you have significant system activity after the message, then it may be unrelated. Its hard to tell in this case as there are no timestamps, so I cant see if there is any activity after the lockup. Features like WITNESS and INVARIANTS may change the timing of the kernel making certain race conditions less likely; I'd run with them for a bit and see if you can reproduce the hang with them present, as they will make debugging the problem a lot easier, if it's possible. Uh, the above *was* me reproducing the hang with them present ;-)) It quite happily hangs with thoise things in the kernel - indeed the next hang was immediately after I rebooted the machine. But even with WITNESS and INVARIANTS and all the rest it does not drop to a debugger, it simply locks up. That machine is currently turned off, but still has 7.1 installed. What would you like me to try now ? I have a lockup I can reproduce pretty reliably now (just wait and it will always lock up). I also found that my other 7.1 box locks up fairly reliably when doing a buildworld. The only similarily between these two machines and the ones which dont lock up is that these are serving DNS. The others don't. Note that all the hardware is identical, as is the installed software and the configuration. I am at a total loss... -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
It was mentioned previous in this thread that CPUTYPE could be an issue. Did you change this if you customized your kernel? Actually, I think thats been ruled out as a possible cause, along with the scheduler. Certainly I have tried it both ways and there is no difference, and I think i saw that the others had too. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Tue, 13 Jan 2009, Pete French wrote: Features like WITNESS and INVARIANTS may change the timing of the kernel making certain race conditions less likely; I'd run with them for a bit and see if you can reproduce the hang with them present, as they will make debugging the problem a lot easier, if it's possible. Uh, the above *was* me reproducing the hang with them present ;-)) It quite happily hangs with thoise things in the kernel - indeed the next hang was immediately after I rebooted the machine. But even with WITNESS and INVARIANTS and all the rest it does not drop to a debugger, it simply locks up. That machine is currently turned off, but still has 7.1 installed. What would you like me to try now ? I have a lockup I can reproduce pretty reliably now (just wait and it will always lock up). I also found that my other 7.1 box locks up fairly reliably when doing a buildworld. The only similarily between these two machines and the ones which dont lock up is that these are serving DNS. The others don't. Note that all the hardware is identical, as is the installed software and the configuration. If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing ctrl-alt-break on the console to see if you can drop into the debugger, or issue a serial break on a serial console. For somewhat complicated reasons to explain, serial breaks are more effective at getting into the debugger, so are preferable -- also because you can more easily log output from the debugger. If you are able to get into the debugger, the normal commands would be most helpful, especially if you can log the results: ps show lockedvnods show alllocks Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break over the serial line? No, ctrl-alt-esc doesnt work, and there is no serial line on the machine (not that I can access anyway) -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Silly question but do you have powerd enabled on that server? If so, does disabling it help? Also do you have any of these in /etc/rc.conf (i.e., they are not the same as the default values in /etc/defaults/rc.conf): performance_cx_lowest=HIGH# Online CPU idle state performance_cpu_freq=NONE # Online CPU frequency economy_cx_lowest=HIGH# Offline CPU idle state economy_cpu_freq=NONE # Offline CPU frequency No, none of those. My rc.conf is below. The only slightly unusual thing I am doing is using lagg rather than the interfaces directly I guess, but that has worked fine for ages. -pete. hostname=florentine.rattatosk cloned_interfaces=lagg0 network_interfaces=lo0 bce0 bce1 lagg0 ifconfig_bce0=up ifconfig_bce1=up ifconfig_lagg0=laggproto lacp laggport bce0 laggport bce1 ipv4_addrs_lagg0=10.48.19.0/16 10.48.19.229/16 10.48.19.223/16 10.48.19.243/16 10.48.19.226/16 10 .48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16 10.48.19.232/16 10.4 8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16 defaultrouter=10.48.0.9 inetd_enable=YES sshd_enable=YES dhcpd_enable=YES dhcpd_ifaces=lagg0 dhcpd_flags=-q dhcpd_conf=/usr/local/etc/dhcpd.conf dhcpd_withumask=022 nfs_client_enable=YES nfs_server_enable=YES portmap_enable=YES rpcbind_enable=YES named_enable=YES pdns_enable=YES pdns_recursor_enable=NO mysql_enable=YES apache22_http_accept_enable=YES apache22_enable=YES ntpd_enable=YES ntpd_sync_on_start=YES exim_enable=YES exim_flags=-bd -q10m sendmail_enable=NONE sendmail_submit_enable=NO sendmail_outbound_enable=NO sendmail_msp_queue_enable=NO ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I can't (fortunately) make it lock up. I have a DL360 G5 which is unused atm. and can test on it if needed. Would it be possible to install that under amd64 and hammer it with DNS requests ? I have been trying to think what the difference might be between my webservers and the machines which are freezing, and the opnly one I an come up with is UDP traffic as the locking machines are serving DNS and also NFS. -pete. ,. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
RE: Big problems with 7.1 locking up :-(
I also am experiencing lock-ups on a server recently upgraded from 7.0-RELEASE to 7.1-STABLE. This server is a Supermicro 6022 dual-Xeon box running a GENERIC i386 SMP kernel. Since upgrading to 7.1-STABLE it has started locking up daily. I see similar symptoms that Pete is seeing - no ping response, no keyboard response, no video output on a very lightly loaded server. I have a test machine with duplicate hardware to the one locking up that I just finished installing 7.1-STABLE on but so far it hasn't locked up. Coincidentally my locking machine is also a DNS server but I have not enabled DNS on my test machine yet. Since the locking server is remote to me, I need to downgrade it to 7.0 to get it stable again. Once I finish that process, I can provide remote access to the 7.1-STABLE machine in my office if anyone would like to test with it. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Tue, 13 Jan 2009, Pete French wrote: I can't (fortunately) make it lock up. I have a DL360 G5 which is unused atm. and can test on it if needed. Would it be possible to install that under amd64 and hammer it with DNS requests ? I have been trying to think what the difference might be between my webservers and the machines which are freezing, and the opnly one I an come up with is UDP traffic as the locking machines are serving DNS and also NFS. There are significant changes in UDP locking between 7.0 and 7.1, so it could be that we're looking at a regression there. If you're able to reproduce this reliably, it might well be worth doing a little search-and-replace in udp_usrreq.c along the following lines: INP_RLOCK_ASSERT - INP_WLOCK_ASSERT INP_RLOCK - INP_WLOCK INP_RUNLOCK - INP_WUNLOCK However, before making these changes for debugging purposes, make sure it's 100% reproduceable without them in the configuration so that we don't find ourselves barking up the wrong tree. Normally deadlocks along these lines *do* allow breaking into the debugger from a serial console, but since there are significant changes here in 7.1 it is worth trying to see if this might be related. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 2009-01-12 at 21:35 +0100, Tomas Randa wrote: I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status waiting for opening table or waiting for close tables I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in da device for example? Thanks Tomas Randa Could you give r186860 a try? It is an MFC into stable/7 so if the machine in question is something you can experiment with just updating to stable/7 would take care of it. Otherwise if you could just manually apply the patch to a 7.1 source tree and do a test build of the kernel that would also do it. I'm not experiencing lockups but this patch helped a lot on a machine I have with a particular disk I/O pattern that resulted in extremely poor performance with 7.1-RELEASE. This patch brought it back to its normal performance level. Thanks. -- Ken Smith - From there to here, from here to | kensm...@cse.buffalo.edu there, funny things are everywhere. | - Theodore Geisel | signature.asc Description: This is a digitally signed message part
Re: Big problems with 7.1 locking up :-(
I've updagraded a test-webserver to 7.1 when it was released. After a few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it has been running without any problems. The webserver is not heavily loaded (load at 2-3 on average). I have made a buildworld -j 8 and it runs fine. If the reported lockup is due to i/o a buildworld will not be able to reproduce it. It has performed a buildworld without problems and I'll be doing some buildworlds throughout the day. This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the build-in p200-controller with 64 MB ram. Forgot to add that CPUTYPE=nocona in /etc/make.conf. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I am also surprised that this isn't more widely reported, as the hardware is very common. The only oddity with ym compile is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but I will remove it anyway, just so I am actually building a completely vanilla amd64. That way I should have what everyone else has, and since I don't see anyone else saying they have isues then maybe mine will go away too (fingers crossed) Intel suggests nocona for x86_64 platforms and prescott for x86 (i386) based platforms on the 4.2 line, because they best matched the cache size and featureset of the Core2 processors. I don't think that core2 support was fully completed in 4.2 (in fact I believe it was just started), and I don't think that our binutils supports it properly. Some thoughts, -Garrett I've updagraded a test-webserver to 7.1 when it was released. After a few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it has been running without any problems. The webserver is not heavily loaded (load at 2-3 on average). I have made a buildworld -j 8 and it runs fine. If the reported lockup is due to i/o a buildworld will not be able to reproduce it. It has performed a buildworld without problems and I'll be doing some buildworlds throughout the day. This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the build-in p200-controller with 64 MB ram. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
It has performed a buildworld without problems and I'll be doing some buildworlds throughout the day. This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the build-in p200-controller with 64 MB ram. I've performed five buildworlds decrementing -j from 16 to 6 and I can't lock up the server. -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I've performed five buildworlds decrementing -j from 16 to 6 and I can't lock up the server. Mine never lock up doing buildworlds either. They only lock up when they are sitting there more of less idle! The machines which have never locked up are the webservers, which are fairly heavlt loaded. The machine which locks up the most frequently is a box sitting there doing nothing but DNS, which is the most lightly loaded of the lot. I am going to roll back to 7.0 on all of the HP machines now, having had yet another day of rebooting locked up machines. I will leave one running 7.1 with the debug options in the kernel to try and get some useful results out of this. All the machines are now running GENERIC with no specail optimisations, CPU types or anything like that. Absolutely out of the box vanilla 7.1/amd64 as far as I know :-( -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Fri, 9 Jan 2009, Garance A Drosihn wrote: At 2:39 PM -0500 1/9/09, Robert Blayzor wrote: On Jan 8, 2009, at 8:58 PM, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. I noticed a problem with 7.0 on a couple of Dell servers. [...] We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Turns out he hasn't reverted back to 7.0-release just yet, so he's going to try SCHED_4BSD and see if that helps his situation. Scheduler changes always come with some risk of exposing bugs that have existed in the code for a long time but never really manifested themselves. ULE is well shaken-out, having been under development for at least five years, but it is possible that some problems will become visible as a result of the switch. I would encourage people to stick with ULE, but if you're having a stability problem then experimenting with scheduler as a variable that could be triggering the problem may well be useful to help track down the bug. Most of the time the bugs will not be in ULE itself, rather, triggered because ULE will change the ordering or balancing of work in the system, so we should try to avoid situations where people switch to 4BSD from ULE and stick with it rather than getting the underlying problem fixed! Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Sat, 10 Jan 2009, Pete French wrote: FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Well, one of my machines just locked up again, even with SCHED_4BSD on it, so I am now thinking it is unrelated. The machine has completely locked - no response to pings, no response to keypresses, nor to the power button. There is nothing printed on the console - it is just sitting there with a login prompt :-( This is really not good - these are extremely common servers after all, and I am just running bog standard 7.1 with apache and mysql. This is happening across several different servers, all of which are slight variants on the DL360, so I dont think it is something perculiar to me. I'm not sure if you've done this already, but the normal suggestions apply: have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do any results / panics / etc result? Sometimes these debugging tools are able to convert hangs into panics, which gives us much more ability to debug them. If it still hangs rather than panicking, are you able to break into the debugger on the console? If you're using a video console and not able to get to the debugger, would it be possible to configure a serial console and use that -- serial breaks are often more successful at getting to the debugger than keyboard breaks. Likewise, I'm not sure if this hardware has an NMI button -- some HP servers have one on the motherboard that you can press -- but that is also potentially a way to get into the debugger the analyze the crash. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I'm not sure if you've done this already, but the normal suggestions apply: have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do any results / panics / etc result? Sometimes these debugging tools are able to convert hangs into panics, which gives us much more ability to debug them. I did, but it turns out I had an incorrect option in there which made the data I got not relevent. I now have another machine running a kernel with the following config: include GENERIC ident DEBUG options KDB options DDB options SW_WATCHDOG options DEBUG_VFS_LOCKS options MUTEX_DEBUG options WITNESS options LOCK_PROFILING options INVARIANTS options INVARIANT_SUPPORT options DIAGNOSTIC Those should enable me to get some useful output I hope. If it still hangs rather than panicking, are you able to break into the debugger on the console? If you're using a video console and not able to get to the debugger, would it be possible to configure a serial console and use I cant add a sserial console - I am remote enough from most of these machines (Slough) and very remote from the test box (its in the USA!) so I cant get to them physicly. But I do have iLo which lets me use the console and gives me a bit of access to the front. I will check for NMI. Just had another lockup here - my working day has become a succession of running round rebooting servers though iLo at the moment. Will get back to you when the debug one has crashed - I could possibly give you direct access to the iLo console on that if you need it ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
At 2:55 PM + 1/12/09, Robert Watson wrote: On Fri, 9 Jan 2009, Garance A Drosihn wrote: At 2:39 PM -0500 1/9/09, Robert Blayzor wrote: On Jan 8, 2009, at 8:58 PM, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. I noticed a problem with 7.0 on a couple of Dell servers. [...] We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Turns out he hasn't reverted back to 7.0-release just yet, so he's going to try SCHED_4BSD and see if that helps his situation. Scheduler changes always come with some risk of exposing bugs that have existed in the code for a long time but never really manifested themselves. ULE is well shaken-out, having been under development for at least five years, but it is possible that some problems will become visible as a result of the switch. I would encourage people to stick with ULE, but if you're having a stability problem then experimenting with scheduler as a variable that could be triggering the problem may well be useful to help track down the bug. Just to followup on this: My friend did switch back to a 7.1 kernel with SCHED_4BSD, and he still ran into problems. The error messages weren't the same, but errors did happen in the same high disk-I/O situations as the lockup happened with SCHED_ULE. At this point he's fallen back to the 7.0-kernel that he had been running (which also has SCHED_ULE), and all the problems have gone away. So at the moment he's running with a 7.0-ish kernel and the 7.1-release userland, without the hanging problems. So the problem is something in the kernel, but it is *NOT* the scheduler (at least, not in his case). He is not eager to do a whole lot of experiments to track down the problem, since this is happening on busy production machines and he can't afford to have a lot of downtime on them (especially now that the semester at RPI has started up). The systems have some large (2 TB) filesystems on them, and the lockups occur in high disk-I/O situations. He's seeing the problem on one system which is a dual CPU quad-core xeon, and another which is a 64 bit P4 with hyperthreading. The one thing in common between the two setups is that the boot drives + a 3ware controller (with its array of RAID disks) is moved from one machine to the other one: its a 3ware 9500 12 port model, the boot drive is connected to an ICH6 in IDE mode, and yes, I've run it in single, single with hyper threading, and 8 way mode. All 64 bit. We still have no idea where the problem really is. For all we know, someone spilled a Pepsi on it when he wasn't looking... -- Garance Alistair Drosehn= g...@gilead.netel.rpi.edu Senior Systems Programmer or g...@freebsd.org Rensselaer Polytechnic Instituteor dro...@rpi.edu ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I'm not sure if you've done this already, but the normal suggestions apply: have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do any results / panics / etc result? Sometimes these debugging tools are able to convert hangs into panics, which gives us much more ability to debug them. OK, I have now had a machine hand again, with the correct debug options in the kernel. The screen looked like this when I went to restart it: http://toybox.twisted.org.uk/~pete/71_lor2.png It had not, however, dropped into any kind of debugger. Also there appear to me console messages after the lock order reversal - is that normal ? The machine did stay up for a signifanct amount of time before doing this. I notice that it is more or less identical to the one I posted whenI had WITNESS_KDB in the kernel too, so maybe those results arent entirely suprious after all ? Given it hasnt dropped to a debugger, is there anything else I can try ? -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Just to followup on this: My friend did switch back to a 7.1 kernel with SCHED_4BSD, and he still ran into problems. The error messages weren't Acually, I dont know if I posted it, but that was the same for me too. The scheduler makes no difference, nor do CPU copile settings. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Hello, I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status waiting for opening table or waiting for close tables I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in da device for example? Thanks Tomas Randa Garance A Drosihn wrote: At 2:55 PM + 1/12/09, Robert Watson wrote: On Fri, 9 Jan 2009, Garance A Drosihn wrote: At 2:39 PM -0500 1/9/09, Robert Blayzor wrote: On Jan 8, 2009, at 8:58 PM, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. I noticed a problem with 7.0 on a couple of Dell servers. [...] We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Turns out he hasn't reverted back to 7.0-release just yet, so he's going to try SCHED_4BSD and see if that helps his situation. Scheduler changes always come with some risk of exposing bugs that have existed in the code for a long time but never really manifested themselves. ULE is well shaken-out, having been under development for at least five years, but it is possible that some problems will become visible as a result of the switch. I would encourage people to stick with ULE, but if you're having a stability problem then experimenting with scheduler as a variable that could be triggering the problem may well be useful to help track down the bug. Just to followup on this: My friend did switch back to a 7.1 kernel with SCHED_4BSD, and he still ran into problems. The error messages weren't the same, but errors did happen in the same high disk-I/O situations as the lockup happened with SCHED_ULE. At this point he's fallen back to the 7.0-kernel that he had been running (which also has SCHED_ULE), and all the problems have gone away. So at the moment he's running with a 7.0-ish kernel and the 7.1-release userland, without the hanging problems. So the problem is something in the kernel, but it is *NOT* the scheduler (at least, not in his case). He is not eager to do a whole lot of experiments to track down the problem, since this is happening on busy production machines and he can't afford to have a lot of downtime on them (especially now that the semester at RPI has started up). The systems have some large (2 TB) filesystems on them, and the lockups occur in high disk-I/O situations. He's seeing the problem on one system which is a dual CPU quad-core xeon, and another which is a 64 bit P4 with hyperthreading. The one thing in common between the two setups is that the boot drives + a 3ware controller (with its array of RAID disks) is moved from one machine to the other one: its a 3ware 9500 12 port model, the boot drive is connected to an ICH6 in IDE mode, and yes, I've run it in single, single with hyper threading, and 8 way mode. All 64 bit. We still have no idea where the problem really is. For all we know, someone spilled a Pepsi on it when he wasn't looking... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status waiting for opening table or waiting for close tables I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in da device for example? It was mentioned previous in this thread that CPUTYPE could be an issue. Did you change this if you customized your kernel? -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 12 Jan 2009, Tomas Randa wrote: I have similar problems. The last good kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status waiting for opening table or waiting for close tables I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in da device for example? So far, this sounds like a different problem than the one others have been posting about, which involves full system freezes rather than specific processes wedging or responding poorly. I'd suggest starting by using procstat -k on the process ID to look at where specific threads are waiting in the kernel. Is it simply that MySQL is being unreasonably slow in certain situations, or does it actually entirely stop operating? If you're able to narrow down the date on the 7.x branch where the problem you're experiencing begins, that would be most helpful. I'd suggest leaving your userspace on the 8th october, and sliding the kernel forward in a binary search until you've narrowed it down a bit. Obviously, this takes a bit of patience, but narrowing it down could be quite informative. Robert N M Watson Computer Laboratory University of Cambridge Thanks Tomas Randa Garance A Drosihn wrote: At 2:55 PM + 1/12/09, Robert Watson wrote: On Fri, 9 Jan 2009, Garance A Drosihn wrote: At 2:39 PM -0500 1/9/09, Robert Blayzor wrote: On Jan 8, 2009, at 8:58 PM, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. I noticed a problem with 7.0 on a couple of Dell servers. [...] We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Turns out he hasn't reverted back to 7.0-release just yet, so he's going to try SCHED_4BSD and see if that helps his situation. Scheduler changes always come with some risk of exposing bugs that have existed in the code for a long time but never really manifested themselves. ULE is well shaken-out, having been under development for at least five years, but it is possible that some problems will become visible as a result of the switch. I would encourage people to stick with ULE, but if you're having a stability problem then experimenting with scheduler as a variable that could be triggering the problem may well be useful to help track down the bug. Just to followup on this: My friend did switch back to a 7.1 kernel with SCHED_4BSD, and he still ran into problems. The error messages weren't the same, but errors did happen in the same high disk-I/O situations as the lockup happened with SCHED_ULE. At this point he's fallen back to the 7.0-kernel that he had been running (which also has SCHED_ULE), and all the problems have gone away. So at the moment he's running with a 7.0-ish kernel and the 7.1-release userland, without the hanging problems. So the problem is something in the kernel, but it is *NOT* the scheduler (at least, not in his case). He is not eager to do a whole lot of experiments to track down the problem, since this is happening on busy production machines and he can't afford to have a lot of downtime on them (especially now that the semester at RPI has started up). The systems have some large (2 TB) filesystems on them, and the lockups occur in high disk-I/O situations. He's seeing the problem on one system which is a dual CPU quad-core xeon, and another which is a 64 bit P4 with hyperthreading. The one thing in common between the two setups is that the boot drives + a 3ware controller (with its array of RAID disks) is moved from one machine to the other one: its a 3ware 9500 12 port model, the boot drive is connected to an ICH6 in IDE mode, and yes, I've run it in single, single with hyper threading, and 8 way mode. All 64 bit. We still have no idea where the problem really is. For all we know, someone spilled a Pepsi on it when he wasn't looking... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 12 Jan 2009, Pete French wrote: I'm not sure if you've done this already, but the normal suggestions apply: have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do any results / panics / etc result? Sometimes these debugging tools are able to convert hangs into panics, which gives us much more ability to debug them. OK, I have now had a machine hand again, with the correct debug options in the kernel. The screen looked like this when I went to restart it: http://toybox.twisted.org.uk/~pete/71_lor2.png It had not, however, dropped into any kind of debugger. Also there appear to me console messages after the lock order reversal - is that normal ? Lock order reversals are warnings of potential deadlock due to a lock cycle, but deadlocks may not actually result, either because it's a false positive (some locking construct that is deadlock free but involves lock cycles), or because a cycle didn't actually form. The message is suggestive, but if you have significant system activity after the message, then it may be unrelated. The machine did stay up for a signifanct amount of time before doing this. I notice that it is more or less identical to the one I posted whenI had WITNESS_KDB in the kernel too, so maybe those results arent entirely suprious after all ? Given it hasnt dropped to a debugger, is there anything else I can try ? Features like WITNESS and INVARIANTS may change the timing of the kernel making certain race conditions less likely; I'd run with them for a bit and see if you can reproduce the hang with them present, as they will make debugging the problem a lot easier, if it's possible. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Mon, 12 Jan 2009, Garance A Drosihn wrote: He is not eager to do a whole lot of experiments to track down the problem, since this is happening on busy production machines and he can't afford to have a lot of downtime on them (especially now that the semester at RPI has started up). The systems have some large (2 TB) filesystems on them, and the lockups occur in high disk-I/O situations. He's seeing the problem on one system which is a dual CPU quad-core xeon, and another which is a 64 bit P4 with hyperthreading. The one thing in common between the two setups is that the boot drives + a 3ware controller (with its array of RAID disks) is moved from one machine to the other one: I think playing the combinatorics game on compile-time flags, kernel features, etc, is probably not the best way to go about debugging this. Instead, I'd debug this as a kernel hang by breaking into the debugger once it occurs, if possible, and ideally on a serial console. Often times hangs can be debugged looking solely at DDB output, or if possible, a crash dump. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I noticed a similar problem testing 7.1-RC1, It seemed to be a deep deadlock, as it was triggered by lighttpd doing kern_sendfile, and never returning. The side effects (being unable to create processes, etc) is similar. Interesting - did you get any responses from anyone else regarding this ? My last box which locked up was essentialy idle, so I am very surprised by all of this - also none of the heavilt loaded machines (i.e. the actual webservers) have locked up. I am also surprised that this isn't more widely reported, as the hardware is very common. The only oddity with ym compile is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but I will remove it anyway, just so I am actually building a completely vanilla amd64. That way I should have what everyone else has, and since I don't see anyone else saying they have isues then maybe mine will go away too (fingers crossed) My kernconf is below, try building the kernel, and send an email containing the backtrace from any process that has blocked (in my OK, will do. I can try this on the one non-essential box which locked up yesterday. I don't know how long it will before it locks up again, but will see if I can do some things to provoke it. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
My kernconf is below, try building the kernel, and send an email containing the backtrace from any process that has blocked (in my Well, I havent managed to get a backtrace, but immediately upon booting the system halts with the following: http://www.twisted.org.uk/~pete/71_lor1.jpg Interestingly, if I try and boot into safe mode then it will not even get that far: http://www.twisted.org.uk/~pete/71_safe1.jpg Am going to try and backtrace that now to see what I can get. Unfortunately I can only provide screen captures rather than actual text output from this due to having to go via a Mac running RDP thought an ssh tunnel to a Windows box and then using IE to go to the iLO :-) Convoluted, but it works... -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Sun, Jan 11, 2009 at 11:27 AM, Pete French petefre...@ticketswitch.com wrote: My kernconf is below, try building the kernel, and send an email containing the backtrace from any process that has blocked (in my Well, I havent managed to get a backtrace, but immediately upon booting the system halts with the following: http://www.twisted.org.uk/~pete/71_lor1.jpg Not Found ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Not Found sorry, see the subsequent email, there are more links there to working PNG's -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Sun, Jan 11, 2009 at 4:45 AM, Pete French petefre...@ticketswitch.com wrote: I noticed a similar problem testing 7.1-RC1, It seemed to be a deep deadlock, as it was triggered by lighttpd doing kern_sendfile, and never returning. The side effects (being unable to create processes, etc) is similar. Interesting - did you get any responses from anyone else regarding this ? My last box which locked up was essentialy idle, so I am very surprised by all of this - also none of the heavilt loaded machines (i.e. the actual webservers) have locked up. I am also surprised that this isn't more widely reported, as the hardware is very common. The only oddity with ym compile is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but I will remove it anyway, just so I am actually building a completely vanilla amd64. That way I should have what everyone else has, and since I don't see anyone else saying they have isues then maybe mine will go away too (fingers crossed) My kernconf is below, try building the kernel, and send an email containing the backtrace from any process that has blocked (in my OK, will do. I can try this on the one non-essential box which locked up yesterday. I don't know how long it will before it locks up again, but will see if I can do some things to provoke it. -pete. Intel suggests nocona for x86_64 platforms and prescott for x86 (i386) based platforms on the 4.2 line, because they best matched the cache size and featureset of the Core2 processors. I don't think that core2 support was fully completed in 4.2 (in fact I believe it was just started), and I don't think that our binutils supports it properly. Some thoughts, -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Well, one of my machines just locked up again, even with SCHED_4BSD on it, so I am now thinking it is unrelated. The machine has completely locked - no response to pings, no response to keypresses, nor to the power button. There is nothing printed on the console - it is just sitting there with a login prompt :-( This is really not good - these are extremely common servers after all, and I am just running bog standard 7.1 with apache and mysql. This is happening across several different servers, all of which are slight variants on the DL360, so I dont think it is something perculiar to me. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
I noticed a similar problem testing 7.1-RC1, It seemed to be a deep deadlock, as it was triggered by lighttpd doing kern_sendfile, and never returning. The side effects (being unable to create processes, etc) is similar. My kernconf is below, try building the kernel, and send an email containing the backtrace from any process that has blocked (in my case, lighttpd attempting to sendfile a large amount of data to php fastcgi triggered it, but that's a guess on my part). Note that this includes witness, and invariants, so performance will be hit. Also, enable watchdogd, and add -e 'ls -al /etc' to it's flags. It should drop you to a debugger with a backtrace within a few seconds of the lock being triggered, and it should output a backtrace and any invariant/witness lock warnings. Obviously if you don't have a serial or local console, don't do this. include GENERIC ident DEBUG options KDB options DDB options SW_WATCHDOG options DEBUG_VFS_LOCKS options INVARIANTS options WITNESS On 1/10/09, Pete French petefre...@ticketswitch.com wrote: FWIW, the other guy I know who is having this problem had already switched to using ULE under 7.0-release, and did not have any problems with it. So *his* problem was probably not related to SCHED_ULE, unless something has recently changed there. Well, one of my machines just locked up again, even with SCHED_4BSD on it, so I am now thinking it is unrelated. The machine has completely locked - no response to pings, no response to keypresses, nor to the power button. There is nothing printed on the console - it is just sitting there with a login prompt :-( This is really not good - these are extremely common servers after all, and I am just running bog standard 7.1 with apache and mysql. This is happening across several different servers, all of which are slight variants on the DL360, so I dont think it is something perculiar to me. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. Since then I have starte seeing machines lock up. This always happens under heavy disc load. When I bring the machine back up then sometimes it fails to fsck due to a partialy truncated inode. The locksup appear to be disc related - on my mysql msater machine it will come back up with files somewhat shorted than those which ahve aready been transmitted to the slave (i.e. some data was in memory, and claimed to have been written to the drive, but never made it onto the disc). The only time I have seen anything useful on the screen was during one lockup where I got a message about a spin lock being held too long and some comment in parentheses about it being a turnstile lock. Help! :-( I am now downgrading all the machine to 7.0 as fast as I can - though the machine I am trying to compile it on has locked up once during the compile so I havent got anywhere so far. The machines are HP Proliant DL360 G5s - they have an embedded P400i RAID controller with a pair of mirrored drives connected. Each one has both ethernets connected, bundled using lagg and LACP. I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
At 09:49 AM 1/9/2009, Guy Helmer wrote: RAID controller with a pair of mirrored drives connected. Each one has both ethernets connected, bundled using lagg and LACP. I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Are you using the same disk controller as Peter ? Do both of you run with quotas on the file system ? By lockup, do you mean it doesnt respond to the network either or just anything that needs disk IO ? ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Are you using the same disk controller as Peter ? Do both of you run with quotas on the file system ? By lockup, do you mean it doesnt respond to the network either or just anything that needs disk IO ? I dont think he can be using yhe same controller, as mine is an embedded HPO unit. they do make a separate plugin one though - P400 SAS controller. My symptoms are that the thing locks hard and respionds to nothing, no keypresses or anything. I am assuming that the disc is the first thing to go though, ebcause I see data which was being written to a file and a processes reading from that file to the network. more of the file comes over the network than makes it phyiscally onto the disc The only useful error I ever saw was the message about spin lock / turnstile locks being held for too long. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Pete French wrote: Are you using the same disk controller as Peter ? Do both of you run with quotas on the file system ? By lockup, do you mean it doesnt respond to the network either or just anything that needs disk IO ? I dont think he can be using yhe same controller, as mine is an embedded HPO unit. they do make a separate plugin one though - P400 SAS controller. My symptoms are that the thing locks hard and respionds to nothing, no keypresses or anything. I am assuming that the disc is the first thing to go though, ebcause I see data which was being written to a file and a processes reading from that file to the network. more of the file comes over the network than makes it phyiscally onto the disc The only useful error I ever saw was the message about spin lock / turnstile locks being held for too long. -pete. OK, perhaps my issue is different then. My symptoms seem to be a hang from anything that triggers a fork(), such as entering a command at a shell prompt or entering a user name at the console's login prompt. Network activity still works -- all the TCP connections stay up until I drop into the kernel debugger or power cycle. Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
On Jan 8, 2009, at 8:58 PM, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. I noticed a problem with 7.0 on a couple of Dell servers. Not sure if this is related but when our system froze the box was pingable, and you could switch virtual consoles... however, you could not type anything on the screen or connect to any sockets. Num-lock would still work so the box wasn't solidly frozen. This used to happen a couple of times every week or two. We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. (our box was a Dell PE1750, 2GB of RAM, amr RAID controller, bge network driver) The primary application was just ntpd and apache with mpm_worker threads. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? -- Robert Blayzor, BOFH INOC, LLC rblay...@inoc.net http://www.inoc.net/~rblayzor/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? Actually you might be on to something there one of the main differences between out test GL360 and the live ones is that the test one has less cores in it, and is under less load. So multiprocessing problems may well show up on the live where they wont on the test box. I shall try building a kernel with the BSD scheduler adn see what happens there. probbaly not today, as am loathe to cause anymore downtime right now. thanks, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Big problems with 7.1 locking up :-(
At 1:58 AM + 1/9/09, Pete French wrote: I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. Since then I have starte seeing machines lock up. This always happens under heavy disc load. When I bring the machine back up then sometimes it fails to fsck due to a partialy truncated inode. The locksup appear to be disc related [...] One of my friends is also having trouble with lockups on two machines he had upgraded to 7.1. Also seems to be related to heavy disk I/O, although I'm not sure the symptoms are the same as what you report. Both machines had been running 7.0-release without trouble. On at least one of the systems, he's also working with (what I consider) very large file systems (over 2 TB). Both machines are using a 3ware controller with its RAID. I realize that isn't much to go on, but it suggests that there is some problem wider than just your (Pete's) usage. I think his situation is such that lockups like this are simply not acceptable, and the last I heard he was reverting back to 7.0-release. -- Garance Alistair Drosehn= g...@gilead.netel.rpi.edu Senior Systems Programmer or g...@freebsd.org Rensselaer Polytechnic Instituteor dro...@rpi.edu ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org