Re: Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()

2009-03-08 Thread Ruben van Staveren

Hi,
On 26 Feb 2009, at 2:28, Charles Sprickman wrote:


On Wed, 25 Feb 2009, Robert Watson wrote:

Just a minor heads up: I've merged both Kip Macy's lock order fixes  
to the kernel routing code, and the route locking and reference  
counting fixes from kern/130652 to stable/7.  These fixes should  
correct a number of reported network-related hangs.  We might want  
to release a subset of these as an errata patch to 7.1 if they  
shake out well in 7-stable.


+1

Charles


Unfortunately these changes let my system panic during early boot,  
around the time when ppp/routing is started. Backing out these changes  
prevents the panic.


I've filed this with some textdumps as: kern/132404: panic sleeping  
thread after 25th Feb src/sys/net commits


http://www.freebsd.org/cgi/query-pr.cgi?pr=132404

Regards,
Ruben
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-25 Thread Pete French
 FYI, I'm currently awaiting testing results from Pete on the MFC of a number 
 of routing table locking fixes, and once that's merged (hopefully tomorrow?) 
 I'll start on the patches in the above PR.  I've taken a crash-course in 
 routing table locking in the last few days... :-)

Just to let you know that I have had zero crashes since I out the patch
live on sunday. Of course thats only three days, but it does look
very much like it has fixed it. I am also running with the other
routing table patch too..

At this point no news is good news, as it is just sitting there
ticking away nicely to itself. I will roll it out to a few more
machines over the next few days.

But looking good so far, I would encourage other people to try
the ptches if they are having problems...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-25 Thread Robert Watson

On Wed, 25 Feb 2009, Pete French wrote:

FYI, I'm currently awaiting testing results from Pete on the MFC of a 
number of routing table locking fixes, and once that's merged (hopefully 
tomorrow?) I'll start on the patches in the above PR.  I've taken a 
crash-course in routing table locking in the last few days... :-)


Just to let you know that I have had zero crashes since I out the patch live 
on sunday. Of course thats only three days, but it does look very much like 
it has fixed it. I am also running with the other routing table patch too..


At this point no news is good news, as it is just sitting there ticking away 
nicely to itself. I will roll it out to a few more machines over the next 
few days.


But looking good so far, I would encourage other people to try the ptches if 
they are having problems...


Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can 
look at the PR and get that in-progress.  Since the code affected by the PR is 
no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly 
since you've had it in production for a while.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()

2009-02-25 Thread Robert Watson


Just a minor heads up: I've merged both Kip Macy's lock order fixes to the 
kernel routing code, and the route locking and reference counting fixes from 
kern/130652 to stable/7.  These fixes should correct a number of reported 
network-related hangs.  We might want to release a subset of these as an 
errata patch to 7.1 if they shake out well in 7-stable.


Thanks again, especially, to Pete for his evaluation of bugs and patches, Kip 
for his fixes in head, and to Dmitrij Tejblum for his submission of the fixes 
in the above-mentioned PR.


Robert N M Watson
Computer Laboratory
University of Cambridge

On Wed, 25 Feb 2009, Robert Watson wrote:


On Wed, 25 Feb 2009, Pete French wrote:

FYI, I'm currently awaiting testing results from Pete on the MFC of a 
number of routing table locking fixes, and once that's merged (hopefully 
tomorrow?) I'll start on the patches in the above PR.  I've taken a 
crash-course in routing table locking in the last few days... :-)


Just to let you know that I have had zero crashes since I out the patch 
live on sunday. Of course thats only three days, but it does look very much 
like it has fixed it. I am also running with the other routing table patch 
too..


At this point no news is good news, as it is just sitting there ticking 
away nicely to itself. I will roll it out to a few more machines over the 
next few days.


But looking good so far, I would encourage other people to try the ptches 
if they are having problems...


Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I can 
look at the PR and get that in-progress.  Since the code affected by the PR 
is no longer in 8.x, I'll merge directly to 7.x, and probably fairly quickly 
since you've had it in production for a while.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-25 Thread cpghost
On Wed, Feb 25, 2009 at 11:04:29AM +, Robert Watson wrote:
 On Wed, 25 Feb 2009, Pete French wrote:
 
  FYI, I'm currently awaiting testing results from Pete on the MFC of a 
  number of routing table locking fixes, and once that's merged (hopefully 
  tomorrow?) I'll start on the patches in the above PR.  I've taken a 
  crash-course in routing table locking in the last few days... :-)
 
  Just to let you know that I have had zero crashes since I out the
  patch live on sunday. Of course thats only three days, but it does
  look very much like it has fixed it. I am also running with the
  other routing table patch too..  At this point no news is good
  news, as it is just sitting there ticking away nicely to itself. I
  will roll it out to a few more machines over the next few days.
  But looking good so far, I would encourage other people to try the
  ptches if they are having problems...
  
 Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so
 that I can look at the PR and get that in-progress.  Since the code
 affected by the PR is no longer in 8.x, I'll merge directly to 7.x,
 and probably fairly quickly since you've had it in production for a
 while.

Great! I hope this patch will also fix the mysterious hangs
I've experienced on Soekris routers since nov/dec 2008. Will
try in a few days and report back any further hangs.

 Robert N M Watson
 Computer Laboratory
 University of Cambridge

Regards,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Various route locking fixes merged to stable/7 (was: Re: Big problems with 7.1 locking up :-()

2009-02-25 Thread Charles Sprickman

On Wed, 25 Feb 2009, Robert Watson wrote:

Just a minor heads up: I've merged both Kip Macy's lock order fixes to the 
kernel routing code, and the route locking and reference counting fixes from 
kern/130652 to stable/7.  These fixes should correct a number of reported 
network-related hangs.  We might want to release a subset of these as an 
errata patch to 7.1 if they shake out well in 7-stable.


+1

Charles



Robert N M Watson
Computer Laboratory
University of Cambridge

On Wed, 25 Feb 2009, Robert Watson wrote:


On Wed, 25 Feb 2009, Pete French wrote:

FYI, I'm currently awaiting testing results from Pete on the MFC of a 
number of routing table locking fixes, and once that's merged (hopefully 
tomorrow?) I'll start on the patches in the above PR.  I've taken a 
crash-course in routing table locking in the last few days... :-)


Just to let you know that I have had zero crashes since I out the patch 
live on sunday. Of course thats only three days, but it does look very 
much like it has fixed it. I am also running with the other routing table 
patch too..


At this point no news is good news, as it is just sitting there ticking 
away nicely to itself. I will roll it out to a few more machines over the 
next few days.


But looking good so far, I would encourage other people to try the ptches 
if they are having problems...


Thanks -- I've gone ahead and merged the patch to 7.x (r189026) so that I 
can look at the PR and get that in-progress.  Since the code affected by 
the PR is no longer in 8.x, I'll merge directly to 7.x, and probably fairly 
quickly since you've had it in production for a while.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-24 Thread Robert Watson

On Mon, 23 Feb 2009, aneeth wrote:


http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=


OK, will give this a try, unless anyone else wants any traces from this 
locked machine ? Is there a known way to tickle this bug when I've 
rebooted, to make sure it's fixed ?


We'v been having similar issues with a couple of our servers as well (7.0 
and 7.1). However the problem shows up only on quad core machines. The dual 
core machines r running fine.


FYI, I'm currently awaiting testing results from Pete on the MFC of a number 
of routing table locking fixes, and once that's merged (hopefully tomorrow?) 
I'll start on the patches in the above PR.  I've taken a crash-course in 
routing table locking in the last few days... :-)


The patches I sent him are at:

  http://www.watson.org/~robert/freebsd/20090221-route-locking.diff

They do not include the patch from the above PR which I want to handle 
separately as it's a significantly different issue.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-23 Thread aneeth


Pete French-2 wrote:
 
 Probably it is your case, try please.

 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=
 
 OK, will give this a try, unless anyone else wants any traces from
 this locked machine ? Is there a known way to tickle this bug
 when I've rebooted, to make sure it's fixed ?
 
 thanks,
 
 -pete.
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 

We'v been having similar issues with a couple of our servers as well (7.0
and 7.1). However the problem shows up only on quad core machines. The dual
core machines r running fine. 
-- 
View this message in context: 
http://www.nabble.com/Big-problems-with-7.1-locking-up-%3A-%28-tp21364913p22176398.html
Sent from the freebsd-stable mailing list archive at Nabble.com.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-21 Thread Robert Watson


On Tue, 17 Feb 2009, Mike Tancsa wrote:

   Do you have any other details about these issues ? Were the fixes 
ever MFC'd


Earlier today I handed off some patches for Pete to test (attached below), 
which he's running alongside the patches in kern/130652.  When I run with the 
patches, basically an MFC of a subset of Kip's routing improvements in 8.x, I 
can no longer reproduce the lock reversal, which will hopefully mean Pete can 
no longer reproduce the hang.  I plan to merge these in a couple of days once 
(with any luck) he's confirmed that is the case.  We may want to get a subset 
of this patch on the errata note path, if we can get the ICMP redirect fix 
down to a very short patch.


Robert N M Watson
Computer Laboratory
University of Cambridge


Merge r185747, r185774, r185807, r185849, r185964, r185965, r186051,
r186052 from head to stable/7; note that only the locking fixes and
invariants checking are added from r185747, but not the move to an
rwlock which would modify the kernel binary interface, nor the move
to a non-recursible lock, which is still seeing problem reports in
head.  This corrects, among other things, a deadlock that may occur
when processing incoming ICMP redirects.

r185747:

  - convert radix node head lock from mutex to rwlock
  - make radix node head lock not recursive
  - fix LOR in rtexpunge
  - fix LOR in rtredirect

  Reviewed by:  sam

r185774:

  - avoid recursively locking the radix node head lock
  - assert that it is held if RTF_RNH_LOCKED is not passed

r185807:

  Fix a bug introduced in r185747: rather than dereferencing an
  uninitialized *rt to something undefined, use the fibnum that came in as
  function argument.

  Found with:   Coverity Prevent(tm)
  CID:  4168

r185849:

  fix a reported panic when adding a route and one hit here when deleting a
  route

  - pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held
  - make sure the rnh lock is held across rt_setgate and rt_getifa_fib

r185964:

  Pass RTF_RNH_LOCKED to rtalloc1 sunce the node head is locked, this avoids
  a recursive lock panic on inet6 detach.

  Reviewed by:  kmacy

r185965:

  RTF_RNH_LOCKED needs to be passed in the flags arg not report,
  apologies to thompsa

r186051:

  in6_addroute is called through rnh_addadr which is always called with the
  radix node head lock held exclusively. Pass RTF_RNH_LOCKED to rtalloc so
  that rtalloc1_fib will not try to re-acquire the lock.

r186052:

  don't acquire lock recursively

All original commits to head were by Kip Macy kmacy, except r185964 by
thompsa.

Reviewed by:bz
Tested by:  Pete French petefrench at ticketswitch com



Property changes on: sys
___
Modified: svn:mergeinfo
   Merged /head/sys:r185747,185774,185807,185849,185964-185965,186051-186052

Index: sys/netinet/in_rmx.c
===
--- sys/netinet/in_rmx.c(revision 188767)
+++ sys/netinet/in_rmx.c(working copy)
@@ -111,7 +111,7 @@
 * ARP entry and delete it if so.
 */
rt2 = in_rtalloc1((struct sockaddr *)sin, 0,
-   RTF_CLONING, rt-rt_fibnum);
+   RTF_CLONING|RTF_RNH_LOCKED, rt-rt_fibnum);
if (rt2) {
if (rt2-rt_flags  RTF_LLINFO 
rt2-rt_flags  RTF_HOST 

Property changes on: sys/dev/cxgb
___
Modified: svn:mergeinfo
   Merged 
/head/sys/dev/cxgb:r185747,185774,185807,185849,185964-185965,186051-186052


Property changes on: sys/dev/ath/ath_hal
___
Modified: svn:mergeinfo
   Merged 
/head/sys/dev/ath/ath_hal:r185747,185774,185807,185849,185964-185965,186051-186052

Index: sys/net/route.c
===
--- sys/net/route.c (revision 188767)
+++ sys/net/route.c (working copy)
@@ -277,6 +277,7 @@
struct rt_addrinfo info;
u_long nflags;
int err = 0, msgtype = RTM_MISS;
+   int needlock;

KASSERT((fibnum  rt_numfibs), (rtalloc1_fib: bad fibnum));
if (dst-sa_family != AF_INET)   /* Only INET supports  1 fib now 
*/
@@ -290,7 +291,13 @@
rtstat.rts_unreach++;
goto miss2;
}
-   RADIX_NODE_HEAD_LOCK(rnh);
+   needlock = !(ignflags  RTF_RNH_LOCKED);
+   if (needlock)
+   RADIX_NODE_HEAD_LOCK(rnh);
+#ifdef INVARIANTS
+   else
+   RADIX_NODE_HEAD_LOCK_ASSERT(rnh);
+#endif
if ((rn = rnh-rnh_matchaddr(dst, rnh)) 
(rn-rn_flags  RNF_ROOT) == 0) {
/*
@@ -343,7 +350,8 @@
RT_LOCK(newrt);
RT_ADDREF(newrt);
}
-   RADIX_NODE_HEAD_UNLOCK(rnh);
+   if 

Re: Big problems with 7.1 locking up :-(

2009-02-18 Thread Robert Watson


On Tue, 17 Feb 2009, Mike Tancsa wrote:


At 05:38 PM 1/29/2009, Robert Watson wrote:


On Fri, 9 Jan 2009, Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0 perfectly 
happily. I have been testing 7.1 in it's various incarnations for the last 
couple of months on our test server and it has performed perfectly.


So the last two days I have been round upgrading all our servers, knowing 
that I had run the system stably on identical hardware for some time.


For those following this other than Pete, who I've been in private 
correspondence with: it seems that he is running into two different 
deadlocks in the routing code.  One of them (at least) is triggered by a 
lock order problem relating to the processing of ICMP redirects -- uncommon 
in most configurations, but quite a few on his network, which triggers 
quickly under load.  Kip Macy has corrected at least one (both?) problems 
in head, and plans to MFC the fixes in the near future.  We'll follow up 
further once the fixes are merged, and if any further problems transpire.


Do you have any other details about these issues ? Were the fixes ever MFC'd


Hi Mike, et al,

I gave Kip a ping about MFCing the fixes and he said he would do that, but has 
apparently been preoccupied.  I'm working on an MFC patch currently, but as 
I'm not all that familiar with the routing code, and the bug fixes were mixed 
with feature enhancements in his original commits, it will probably take me a 
bit longer to produce a candidate patch.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-17 Thread Mike Tancsa

At 05:38 PM 1/29/2009, Robert Watson wrote:


On Fri, 9 Jan 2009, Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various 
incarnations for the last couple of months on our test server and 
it has performed perfectly.


So the last two days I have been round upgrading all our servers, 
knowing that I had run the system stably on identical hardware for some time.


For those following this other than Pete, who I've been in private 
correspondence with: it seems that he is running into two different 
deadlocks in the routing code.  One of them (at least) is triggered 
by a lock order problem relating to the processing of ICMP redirects 
-- uncommon in most configurations, but quite a few on his network, 
which triggers quickly under load.  Kip Macy has corrected at least 
one (both?) problems in head, and plans to MFC the fixes in the near 
future.  We'll follow up further once the fixes are merged, and if 
any further problems transpire.


Hi Robert,
Do you have any other details about these issues ? Were the 
fixes ever MFC'd


---Mike 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-15 Thread Stefan Lambrev

Hi,

Just to let you know what's going on with the issue.

I tried kern.hz=100 on GENERIC 7.1, but the soekris started rebooting  
with ethernet only traffic.

I made a custom kernel with RELENG_7 from 13.Feb and:
options CPU_SOEKRIS
options CPU_GEODE

The soekris is quite stable now and I'm unable to freeze it so far :)

On Feb 8, 2009, at 10:28 PM, Mike Tancsa wrote:


At 10:11 AM 2/8/2009, Stefan Lambrev wrote:

Hi all,

In this thread someone mention a problem with soekris devices.
I personally have one of those new soekris devices and installed 7.1R
and it is very easy to freeze it.
All that I have to do is to copy big file vfer WIFI (atheros) with
speed higher then 1-2MB/s.



Try and copy across the ethernet.   I have several RELENG_7 boxes  
deployed on soekris and Alix boards (same chipset pretty well) and  
have not seen any stability issues.



   ---Mike
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org 



--
Best Wishes,
Stefan Lambrev
ICQ# 24134177





___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-13 Thread Guy Helmer

Guy Helmer wrote:
FWIW, I think I have tracked down the changes just prior to 
7.1-RELEASE that is causing my Supermicro dual Xeon machines to 
wedge.  I did the binary search between 2008-10-02 and 2008-11-24 
without reproducing any lockups, and then I went on to search between 
2008-11-24 and 2009-01-04.  An SMP kernel build from 2008-12-22 
(r186409) sources was stable for over two weeks; a kernel built from 
2008-12-29 (r186590) sources wedged in under 24 hours under moderate 
load.


It appears that the significant changes between r186409 and r186590 
were r186552 (delphij - reverted ATA changes) and r186535/r186534 
(delphij - reverted bce changes).  My machines don't have bce 
interfaces, so I suspect the ATA changes.


Never mind.  I'm stepping back through older kernels and finding that 
the hangs are now occurring in kernels that had seemed to be stable...


Guy

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-12 Thread Guy Helmer

Guy Helmer wrote:

Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.

So the last two days I have been round upgrading all our servers, 
knowing

that I had run the system stably on identical hardware for some time.

Since then I have starte seeing machines lock up. This always happens 
under
heavy disc load. When I bring the machine back up then sometimes it 
fails

to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than  those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been 
written

to the drive, but never made it onto the disc).

The only time I have seen anything useful on the screen was during 
one lockup

where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.

Help! :-(

I am now downgrading all the machine to 7.0 as fast as I can - though 
the
machine I am trying to compile it on has locked up once during the 
compile

so I havent got anywhere so far.

The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.

  
I can't tell whether my situation is related, but I am seeing lockups 
on SMP Supermicro servers with both older (NetBurst-ish) and current 
Xeon CPUs.  I have been dropping into the kernel debugger and getting 
lock information and process backtraces, but so far nothing has been 
conclusively identified.  I think the issue I'm seeing was introduced 
sometime between October 2 and November 24 in the RELENG_7 branch, and 
I suppose the next step is to do a binary search for the offending 
change.


Guy

FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE 
that is causing my Supermicro dual Xeon machines to wedge.  I did the 
binary search between 2008-10-02 and 2008-11-24 without reproducing any 
lockups, and then I went on to search between 2008-11-24 and 
2009-01-04.  An SMP kernel build from 2008-12-22 (r186409) sources was 
stable for over two weeks; a kernel built from 2008-12-29 (r186590) 
sources wedged in under 24 hours under moderate load.


It appears that the significant changes between r186409 and r186590 were 
r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij - 
reverted bce changes).  My machines don't have bce interfaces, so I 
suspect the ATA changes.


Any thoughts?

Thanks,
Guy

--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Pete French
 load.  Kip Macy has corrected at least one (both?) problems in head, and
 plans to MFC the fixes in the near future.  We'll follow up further once
 the fixes are merged, and if any further problems transpire.

Hi, just wondering if we are any closer to having the MFC for this yet, or
if there are any patches I could test ?

cheers,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Stefan Lambrev

Hi all,

In this thread someone mention a problem with soekris devices.
I personally have one of those new soekris devices and installed 7.1R  
and it is very easy to freeze it.
All that I have to do is to copy big file vfer WIFI (atheros) with  
speed higher then 1-2MB/s.
It takes less then 2 minutes to freeze. I wonder if there is some  
improvement
in 7.1-stable so I can try it or if I can help by compiling debug  
kernel?
But I'm not sure if this is the same problem as it may be just the  
wireless driver in my case.


On Feb 8, 2009, at 3:11 PM, Pete French wrote:

load.  Kip Macy has corrected at least one (both?) problems in  
head, and
plans to MFC the fixes in the near future.  We'll follow up further  
once

the fixes are merged, and if any further problems transpire.


Hi, just wondering if we are any closer to having the MFC for this  
yet, or

if there are any patches I could test ?

cheers,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org 



--
Best Wishes,
Stefan Lambrev
ICQ# 24134177





___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread cpghost
On Sun, Feb 08, 2009 at 05:11:02PM +0200, Stefan Lambrev wrote:
 Hi all,
 
 In this thread someone mention a problem with soekris devices.
 I personally have one of those new soekris devices and installed 7.1R  
 and it is very easy to freeze it.
 All that I have to do is to copy big file vfer WIFI (atheros) with  
 speed higher then 1-2MB/s.
 It takes less then 2 minutes to freeze. I wonder if there is some  
 improvement
 in 7.1-stable so I can try it or if I can help by compiling debug  
 kernel?
 But I'm not sure if this is the same problem as it may be just the  
 wireless driver in my case.

One some net4801's without WIFI, I also experience frequent
freezes after a couple of hours up to 2-5 days... so it's
probably not only ath related.

What's your kern.hz value? In my /boot/loader.conf, it is set
to 100. Could you try it too, and see if you can still freeze
the box (just to rule out some weird timing / interrupt issue)?

 Best Wishes,
 Stefan Lambrev
 ICQ# 24134177

Regards,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Mike Tancsa

At 10:11 AM 2/8/2009, Stefan Lambrev wrote:

Hi all,

In this thread someone mention a problem with soekris devices.
I personally have one of those new soekris devices and installed 7.1R
and it is very easy to freeze it.
All that I have to do is to copy big file vfer WIFI (atheros) with
speed higher then 1-2MB/s.



Try and copy across the ethernet.   I have several RELENG_7 boxes 
deployed on soekris and Alix boards (same chipset pretty well) and 
have not seen any stability issues.



---Mike 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-29 Thread Robert Watson


On Fri, 9 Jan 2009, Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0 perfectly 
happily. I have been testing 7.1 in it's various incarnations for the last 
couple of months on our test server and it has performed perfectly.


So the last two days I have been round upgrading all our servers, knowing 
that I had run the system stably on identical hardware for some time.


For those following this other than Pete, who I've been in private 
correspondence with: it seems that he is running into two different deadlocks 
in the routing code.  One of them (at least) is triggered by a lock order 
problem relating to the processing of ICMP redirects -- uncommon in most 
configurations, but quite a few on his network, which triggers quickly under 
load.  Kip Macy has corrected at least one (both?) problems in head, and plans 
to MFC the fixes in the near future.  We'll follow up further once the fixes 
are merged, and if any further problems transpire.


Robert N M Watson
Computer Laboratory
University of Cambridge



Since then I have starte seeing machines lock up. This always happens under
heavy disc load. When I bring the machine back up then sometimes it fails
to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than  those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been written
to the drive, but never made it onto the disc).

The only time I have seen anything useful on the screen was during one lockup
where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.

Help! :-(

I am now downgrading all the machine to 7.0 as fast as I can - though the
machine I am trying to compile it on has locked up once during the compile
so I havent got anywhere so far.

The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.

Advice ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-25 Thread Cian Hughes

Pete,
Have you considered enabling serial console emulation in the BIOS on  
the machines.
I have got my iLo cards set up to redirect the serial ports on my HP  
servers so that I can ssh into the ILO cards and by typing Esc-Q  
access what I would otherwise see on a serial console.

Unfortunately I don't have a DL360 to try and reproduce your problem on.

Regards,
Cian.
On 12 Jan 2009, at 15:16, Pete French wrote:


I cant add a sserial console - I am remote enough from most of
these machines (Slough) and very remote from the test box (its in  
the USA!)
so I cant get to them physicly. But I do have iLo which lets me use  
the
console and gives me a bit of access to the front. I will check for  
NMI.




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-23 Thread Kris Kennaway

Pete Carah wrote:
Well, following up on my own reply earlier, I csup'd releng_7 with a 
date of last dec 1; the result works fine
in the laptop.  I'll reload the eastern soekris tonight and see how it 
does.  If the soekris is fine also then this gives a data point for 
whenever the bad commit(s) happened.


I had apparently made the mistaken assumption that a general release 
should be better debugged than the work-in-progress leading up to it...


I'm sorry FreeBSD has failed to live up to your expectations.  As ever, 
we strive to fix more bugs than we introduce, but in a changing codebase 
this is never possible to guarantee or to always achieve.


Kris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-21 Thread Doug Barton
Pete Carah wrote:
 I have done some (lots of) kernel debugging in the past.  I have several 
 points:
 
 1. I shouldn't *have* to kernel debug for a normal usage of an
 official release.

Um, why not? We certainly put every possible effort into making sure
that releases (and in fact stable branches in general) do not regress,
but it is inevitable that with the extraordinarily wide range of
hardware and workloads that run FreeBSD there will be regressions.
Mark has already commented on the need for wider community review of
release candidates, so I won't go there.

Helping the community understand and hopefully fix the problem when
something goes pear-shaped for you is part of the price of free
software. If you're unwilling (or unable, as you indicate below) to
pay that price, you may just have to find another tool for the job.

 2. One of the soekris boxes is 2800 MILES away, in a remote location,

That's what serial consoles and remote power switches are for. It's
not our fault if you don't have access to those.

 3. I can't afford the time to debug my tools (freebsd is a tool, not an
 experiment, for lots of people, including me...)  I use this laptop at 
 work in a place where I am *not* working on freebsd. (nor am I even allowed
 to at work...)


Doug (who's still a volunteer, last time I checked)

-- 

This .signature sanitized for your protection
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete French
 yes, do ps - threads in state L or LL and RUN are especially interesting,
 trace of pids 28, 27, and threads wich L on locked chan.

heres the output of alllocks,

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

here are the pages of PS:

http://toybox.twisted.org.uk/~pete/71_lock_ps2/

(next time I boot this I will disable http to avoid getting so many)

I cant see any which are in L, LL or RUN state there though. A few RL
and WL towards the end. Traces on 28 and 27 are here:

http://toybox.twisted.org.uk/~pete/71_trace_28.png
http://toybox.twisted.org.uk/~pete/71_trace_27a.png
http://toybox.twisted.org.uk/~pete/71_trace_27b.png

I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU
state, so may be of interest ?

http://toybox.twisted.org.uk/~pete/71_trace_19.png
http://toybox.twisted.org.uk/~pete/71_trace_16.png

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Chagin Dmitry
On Mon, Jan 19, 2009 at 11:39:08AM +, Pete French wrote:
  yes, do ps - threads in state L or LL and RUN are especially interesting,
  trace of pids 28, 27, and threads wich L on locked chan.
 
 heres the output of alllocks,
 
   http://toybox.twisted.org.uk/~pete/71_show_alllocks.png
 
 here are the pages of PS:
 
   http://toybox.twisted.org.uk/~pete/71_lock_ps2/
 
 (next time I boot this I will disable http to avoid getting so many)
 
 I cant see any which are in L, LL or RUN state there though. A few RL
 and WL towards the end. Traces on 28 and 27 are here:
 
   http://toybox.twisted.org.uk/~pete/71_trace_28.png
   http://toybox.twisted.org.uk/~pete/71_trace_27a.png
   http://toybox.twisted.org.uk/~pete/71_trace_27b.png
 
 I also did traces on 19 and 16 as (like 28 and 27) they are in a CPU
 state, so may be of interest ?
 
   http://toybox.twisted.org.uk/~pete/71_trace_19.png
   http://toybox.twisted.org.uk/~pete/71_trace_16.png
 

Probably it is your case, try please.

http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete French
 Probably it is your case, try please.

 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

OK, will give this a try, unless anyone else wants any traces from
this locked machine ? Is there a known way to tickle this bug
when I've rebooted, to make sure it's fixed ?

thanks,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah

   Kris writes:
 You and anyone else seeing performance problems should try to work 
 through the advice given here:


   [1]http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf

Well,  all the people in this thread have noticed that WITH NO CONFIG CHANGES f
rom configs
that worked fine in the past, their systems are very slow and/or locking up (mi
ne are both) with
the stable branch sometime (I noticed it sometime in December, but it got worse
 with the release.)
Most were OK in October; mine (I think) were OK in late November - may narrow t
hings down?  Two of my
systems that lock up have no internal visibility when they do (Soekris 4801's r
outing; the only
time-intensive things running are routing (done in irq context) and pflog.  The
se run with 60+
meg ram free.)  These are complete lockups, though I did manage to get a ps out
 of my laptop last
night by waiting 20 _minutes_ for it to start (!).  This is not a generic perfo
rmance problem.  The laptop
had 55 minutes of cpu time in the softdepflush thread after being up about an h
our and 10 mins;
this might give a hint.  I didn't spot LL/RL state threads at the same time bec
ause I didn't know
to.  Now I do.  BTW - the same ps showed 8 or so user-space procs in R state wi
th NO cpu time; the
kernel was hogging all of it for over an hour.
Firefox did indeed trigger this one as someone else noted.  A soekris doing onl
y routing+nat has no such
excuse...  At least PHK was nice enough to note the watchdog in another thread
:-)

-- Pete

References

   1. http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete French
 Probably it is your case, try please.
 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

Well, I have been running this for a while now. I still get this:

http://toybox.twisted.org.uk/~pete/71_lor3.png

On the console, but so far the machine has not crashed. Obviously it's
only been an hour or so as yet, buit given that it was freezing in
about 5 minutes earlier this morning it does look good. So thanks
for a good patch so far ;-)

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete French
 http://www.freebsd.org/cgi/query-pr.cgi?pr=130652cat=

Looks like I spoke too soon - It just locked up again I am afraid.
Sitting there now at the debug prompt. It does, however, look very
different this time: For example here is 'show alllocks':

http://toybox.twisted.org.uk/~pete/71_alllocks2.png

That shows a lot of locks in UDP - is this the kind of thing you
were worried about Robert ?

When I do a 'ps' there are, this time, a number of processes in the 'LL'
and 'L' state. The images of the 'ps' and the traces or those locked
processed are to be found here:

http://toybox.twisted.org.uk/~pete/71_lock_ps3/

I tried to keep the threads which belong to each process together.

What else can I get out of this lockup ? It looks like the most
promising so far...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete French
 There are significant changes in UDP locking between 7.0 and 7.1, so it could
 be that we're looking at a regression there.  If you're able to reproduce this
 reliably, it might well be worth doing a little search-and-replace in
 udp_usrreq.c along the following lines:

INP_RLOCK_ASSERT - INP_WLOCK_ASSERT
INP_RLOCK - INP_WLOCK
INP_RUNLOCK - INP_WUNLOCK

Given that the latest lockup (see other email) has lots of locks in the UDP
code, would you like me to try this next ? The kernel which has just locked
is one using Dimtry's patch from 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130652

I am not sure why that would give me different traces during the lockup
though. I was doing a lot more TCP traffic this time, but that shouldnt
interfere with UDP should it ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Kris Kennaway

Pete Carah wrote:

Kris writes:

You and anyone else seeing performance problems should try to work 
through the advice given here:



   http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf 
http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf

Well,  all the people in this thread have noticed that WITH NO CONFIG CHANGES 
from configs
that worked fine in the past, their systems are very slow and/or locking up (mine are both) with 
the stable branch sometime (I noticed it sometime in December, but it got worse with the release.)  
Most were OK in October; mine (I think) were OK in late November - may narrow things down?  Two of my 
systems that lock up have no internal visibility when they do (Soekris 4801's routing; the only

time-intensive things running are routing (done in irq context) and pflog.  
These run with 60+
meg ram free.)  These are complete lockups, though I did manage to get a ps out of my laptop last 
night by waiting 20 _minutes_ for it to start (!).  This is not a generic performance problem.  The laptop

had 55 minutes of cpu time in the softdepflush thread after being up about an 
hour and 10 mins;
this might give a hint.  I didn't spot LL/RL state threads at the same time 
because I didn't know
to.  Now I do.  BTW - the same ps showed 8 or so user-space procs in R state 
with NO cpu time; the
kernel was hogging all of it for over an hour.
Firefox did indeed trigger this one as someone else noted.  A soekris doing 
only routing+nat has no such
excuse...  At least PHK was nice enough to note the watchdog in another thread 
:-)


Actually, there have been several apparently different problems reported 
in this thread, some of which (including the message I replied to) *are* 
generic my system is slower problems.


For generic my system hangs problems, see the chapter on kernel 
debugging in the handbook or follow the (same) advice given by Robert 
earlier in the thread.


Kris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah
I have done some (lots of) kernel debugging in the past.  I have several 
points:

1. I shouldn't *have* to kernel debug for a normal usage of an
official release.

2. One of the soekris boxes is 2800 MILES away, in a remote location,
with noone present that is a skilled (or, indeed, any kind of) programmer.
I usually thought I could trust a release, especially when I had been
using the stable branch updated at about monthly intervals on 3 servers
with no problems.  (actually, I waited a while on 7.0 because .0 releases
are traditionally quirky; in this case 7.0-rel worked fine and 7.1 has
problems.)  (and my servers are still running the *same* compilation of
kernel/world with no problems; the hangs are unique to either the laptop
(which only started doing this badly with a Jan 9 csup) and the Soekris boxes
(which started hangs sometime in December; they clearly don't run X...)
[ I've backed my house source to -stable of 12/1/08 and hope this will help;
I don't have the time to fool around too much, and particularly to kernel
debug something that shouldn't need it.]

I can't even start X at all on this laptop now.  At least I can boot it,
but it isn't much use for work unless it can run X.

3. I can't afford the time to debug my tools (freebsd is a tool, not an
experiment, for lots of people, including me...)  I use this laptop at 
work in a place where I am *not* working on freebsd. (nor am I even allowed
to at work...)

-- Pete

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Mark Linimon
On Mon, Jan 19, 2009 at 04:59:59PM -0500, Pete Carah wrote:
 I shouldn't *have* to kernel debug for a normal usage of an
 official release.

Agreed, but the problems that people are having do not seem to have
arisen on any of the systems that ran prelease tests for 7.1.  Although
I'm sure it does not seem that way to you, 7.1R had a very long QA cycle,
and as far as I knew all the showstopper issues had already been addressed
(although I don't officially speak for re@, I'm just an observer.)

With my bugmeister hat on, I'll happily accept suggestions about how
we can get more people involved in testing the prelease images.  Clearly
the situtation we're in right now is not where anyone wants us to be.

mcl
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-19 Thread Pete Carah
Well, following up on my own reply earlier, I csup'd releng_7 with a 
date of last dec 1; the result works fine
in the laptop.  I'll reload the eastern soekris tonight and see how it 
does.  If the soekris is fine also then this gives a data point for 
whenever the bad commit(s) happened.


I had apparently made the mistaken assumption that a general release 
should be better debugged than the work-in-progress leading up to it...


As I noted before - I'm in the business of *using* computers, not doing 
fbsd kernel work (I actually do linux kernel
(device driver) work in $dayjob, but so far prefer fbsd for general use, 
like routers and servers.)


I need to regen the soekris config here with the 12.01 also; if it 
doesn't hang either then I can hope that someone can
look through commit notes (I certainly don't have the time or internal 
knowledge of 7.x to do so) and try figure out
what may have happened.  My daughter is tired of rebooting the soekris 
that is 2800 miles west of here.


One extra data point: the systems that work OK with the release code 
have Intel chipsets (older - ich3 and ich5).  The laptop is an AMD64 in 
32-bit mode with an ATI chipset and broadcom wireless (hence uses 
project evil, which has its own problems with hangs).  Soekris is Geode 
SC1100 with its own builtin chipset, presumably a mish-mash of things 
from Cyrix, National, and AMD, given the Geode series's history.  It is 
not possible to gen a system or kernel on the soekris; gcc 4.x won't run 
in 128mb of ram with no swap (they run from cf cards; swap is not possible.)
I have to cross-compile them and reload the cf cards externally if 
possible; if not I use nfs (which breaks the system
badly if it hangs during a make install; this happened this past 
weekend, fortunately not on the other coast :-(


-- Pete

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Kris Kennaway

Tomas Randa wrote:

Hello,

I have similar problems. The last good kernel I have from stable 
brach, october the 8. Then in next upgrade, I saw big problems with 
performance.

I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
lot of time with status waiting for opening table or waiting for 
close tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
areca SATA controller. Could not be problem in da device for example?


You and anyone else seeing performance problems should try to work 
through the advice given here:


  http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf

Kris

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Michel Talon
Tomas Randa wrote:
 Hello,
 
 I have similar problems. The last good kernel I have from stable 
 brach, october the 8. Then in next upgrade, I saw big problems with 
 performance.

I can add a me too here. This is on my desktop, very lightly loaded.
This computer never had a single problem under FreeBSD so i don't suspect
a hardware problem. My previous upgrade was FreeBSD 7.0-STABLE #0: Tue
Jul 22, and worked perfectly fine with exactly the same software
configuration. 
Now i have FreeBSD 7.1-STABLE #0: Mon Jan  5 , and the situation is
disastrous. Freshly after boot the machine seems to work normal, but
after a few days it becomes slower and slower, windows takes seconds to
appear, firefox3 begins to have garbled output, etc. Then i had the
following problem, firefox got stuck in kernel, impossible to kill it by
kill -9. Needless to say i inspected everything, dmesg, xsession-errors,
top, etc. without seeing anything suspicious. So i rebooted, and bingo!
the machine paniced, mentioning firefox. But the panic itself get stuck
and i had to push the reset button, so no dump. After reboot, machine
works OK for two or three days, then problems begin again. I am
convinced there is a big problem in the kernel. For reference, here is
top and dmesg:

CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 264M Active, 613M Inact, 485M Wired, 22M Cache, 112M Buf, 116M Free
Swap: 2023M Total, 4K Used, 2023M Free

  PID USERNAME   THR PRI NICE   SIZERES STATE  C   TIME   WCPU
COMMAND
62965 michel   1  440  3532K  1884K CPU1   1   0:00  0.29%
top
 2327 root 1  440   161M 29228K select 1  30:39  0.00%
Xorg
95937 root 1  440 24112K 16800K select 1   2:35  0.00%
kdm-bin_gr
 3099 root 1   40  3304K  1028K select 0   1:30  0.00%
moused
 2209 news 1   80  3464K  1052K wait   0   0:37  0.00%
sh
  884 root 1  440  4712K  2028K select 1   0:12  0.00%
ntpd
  453 _pflogd  1 -580  3380K  1352K bpf0   0:11  0.00%
pflogd
 1634 www  1   40  6268K  2656K kqread 0   0:10  0.00%
lighttpd
  788 root 1  440  3164K  3184K select 0   0:04  0.00%
amd
 2206 news 1  440 15208K 12160K select 0   0:03  0.00%
innd
  879 root 9   40  5432K  2460K kqread 1   0:02  0.00%
nscd
  955 root 1  440  2736K  1216K select 1   0:02  0.00%
master
  758 root 1  440  3164K  1340K select 1   0:02  0.00%
ypbind
...

so no memory problem

Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.1-STABLE #0: Mon Jan  5 14:29:23 CET 2009
mic...@niobe.lpthe.jussieu.fr:/usr/obj/usr/src/sys/NIOBE
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.06GHz (3073.65-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf27  Stepping = 7
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x4400CNXT-ID,xTPR
  Logical CPUs per core: 2
real memory  = 1610530816 (1535 MB)
avail memory = 1568387072 (1495 MB)
ACPI APIC Table: ASUS   P4PE
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
This module (opensolaris) contains code covered by the
Common Development and Distribution License (CDDL)
see http://opensolaris.org/os/licensing/opensolaris_license/
ioapic0 Version 2.0 irqs 0-23 on motherboard
acpi0: ASUS P4PE on motherboard
acpi0: Overriding SCI Interrupt from IRQ 9 to IRQ 22
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a (3) failed
acpi0: reservation of 10, 5ff0 (3) failed
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0xe408-0xe40b on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
agp0: Intel 82845G host to AGP bridge on hostb0
pcib1: ACPI PCI-PCI bridge at device 1.0 on pci0
pci1: ACPI PCI bus on pcib1
vgapci0: VGA-compatible display port 0xd800-0xd8ff mem 
0xe000-0xefff,0xdf00-0xdf00 irq 16 at device 0.0 on pci1
uhci0: Intel 82801DB (ICH4) USB controller USB-A port 0xb800-0xb81f irq 16 at 
device 29.0 on pci0
uhci0: [GIANT-LOCKED]
uhci0: [ITHREAD]
usb0: Intel 82801DB (ICH4) USB controller USB-A on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 on usb0
uhub0: 2 ports with 2 removable, self powered
uhci1: Intel 82801DB (ICH4) USB controller USB-B port 0xb400-0xb41f irq 19 at 
device 29.1 on pci0
uhci1: [GIANT-LOCKED]
uhci1: [ITHREAD]
usb1: Intel 82801DB (ICH4) USB controller USB-B on uhci1
usb1: USB revision 1.0
uhub1: Intel 

Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread dick hoogendijk
On Sun, 18 Jan 2009 13:21:17 +0100
Michel Talon ta...@lpthe.jussieu.fr wrote:
 My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked
 perfectly fine with exactly the same software configuration. 
 Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is
 disastrous.

Makes you wonder on on earth could have changed that much between
7.0/7.1 Nice upgrade.. This should not happen on the same hardware!

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS sxce snv105 ++
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-18 Thread Claus Guttesen
 My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked
 perfectly fine with exactly the same software configuration.
 Now i have FreeBSD 7.1-STABLE #0: Mon Jan5, and the situation is
 disastrous.

 Makes you wonder on on earth could have changed that much between
 7.0/7.1 Nice upgrade.. This should not happen on the same hardware!

There will always be changes when new features/options/enhancements
are introduced. Me for my part have never had any serious trouble with
FreeBSD what so ever since FreeBSD 5.1/2 when some kernel-limits had
to be changed. My problem was solved with the help from this list.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
 If you are able to get into the debugger, the normal commands would be most 
 helpful, especially if you can log the results:

It finally locked up, and ctrl-alt-esc got me into the debugger at
last! is there anything else you want me to get whilst it is
like that aside from:

ps
show lockedvnods
show alllocks

which I can go and capture as screenshots. I can probably sort out console
access to it potentially if taht would eb useful whilst it is in this
state ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
ps

output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
there are a lot of processes as this machine runes the same webservices
as the actual webservers, just that nobody connects to them.

show lockedvnods

nothing - there are no locked vnodes

show alllocks

this gives me 'no suich command' theres a whole list of things I
can show, but none of them look like all the locks. what about the locktree
or the lockchain ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Chagin Dmitry
On Fri, Jan 16, 2009 at 12:35:49PM +, Pete French wrote:
 ps
 
 output from 'ps' is here: http://toybox.twisted.org.uk/~pete/71_lock_ps/
 there are a lot of processes as this machine runes the same webservices
 as the actual webservers, just that nobody connects to them.
 
 show lockedvnods
 
 nothing - there are no locked vnodes
 
 show alllocks
 
 this gives me 'no suich command' theres a whole list of things I
 can show, but none of them look like all the locks. what about the locktree
 or the lockchain ?
 

hi, please type:
show lock 0xff0001254d20
and then show thread 0xXXX where X is 'owner' of previous output.


-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
 hi, please type:
 show lock 0xff0001254d20
 and then show thread 0xXXX where X is 'owner' of previous output.

http://toybox.twisted.org.uk/~pete/71_pdns_lock.png

That's in Power DNS - which is interesting because the one difference
between the boxes that lock and those which dont is that the locking
ones are serving DNS.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Chagin Dmitry
On Fri, Jan 16, 2009 at 01:34:14PM +, Pete French wrote:
  hi, please type:
  show lock 0xff0001254d20
  and then show thread 0xXXX where X is 'owner' of previous 
  output.
 
 http://toybox.twisted.org.uk/~pete/71_pdns_lock.png
 
 That's in Power DNS - which is interesting because the one difference
 between the boxes that lock and those which dont is that the locking
 ones are serving DNS.
 

trace 832

-- 
Have fun!
chd
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Robert Watson


On Fri, 16 Jan 2009, Pete French wrote:

hi, please type: show lock 0xff0001254d20 and then show thread 
0xXXX where X is 'owner' of previous output.


http://toybox.twisted.org.uk/~pete/71_pdns_lock.png

That's in Power DNS - which is interesting because the one difference 
between the boxes that lock and those which dont is that the locking ones 
are serving DNS.


I rather feared as much.  Let's run down the path of perhaps there's a 
problem with the new UDP locking code for a bit and see where it takes us. 
Is it possible to run those boxes with WITNESS -- I believe that the fact that 
show alllocks is failing is because WITNESS isn't present.  The other thing 
we can do is revert UDP to using purely write locks -- the risk there is that 
it might change the timing but not actually resolve the bug, so if we can 
analyze it a bit using WITNESS first that would be useful.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
 trace 832

http://toybox.twisted.org.uk/~pete/71_trace_832_1.png
http://toybox.twisted.org.uk/~pete/71_trace_832_2.png

-pete.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
 I rather feared as much.  Let's run down the path of perhaps there's a 
 problem with the new UDP locking code for a bit and see where it takes us. 
 Is it possible to run those boxes with WITNESS -- I believe that the fact that
 show alllocks is failing is because WITNESS isn't present.

Yes, I can do that. The only reason I wasn't running with WITNESS is that
it didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if
a simple GENERIC kernel would lock up when I added that. I will go
back and add WITNESS when you tell me theres nothing more
we can get out of this lock up (recompiling will involve restarting the
machine so I loose the 'boekn to debugger' state). Should I add
anything else ? Skip spinlocks ? Invariants ?

 The other thing we can do is revert UDP to using purely write locks -- the
 risk there is that it might change the timing but not actually resolve the
 bug, so if we can analyze it a bit using WITNESS first that would be useful.

Yes, I will run with WITNESS and anything else you might want. Is there
anything else you, or anyone else, wants from this kernel ? It may take
another day to lock up when I've restarted it unfortunately.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Robert Watson


On Fri, 16 Jan 2009, Pete French wrote:

I rather feared as much.  Let's run down the path of perhaps there's a 
problem with the new UDP locking code for a bit and see where it takes us. 
Is it possible to run those boxes with WITNESS -- I believe that the fact 
that show alllocks is failing is because WITNESS isn't present.


Yes, I can do that. The only reason I wasn't running with WITNESS is that it 
didn't lock up when I added the BREAK_TO_DEBUGGER so I was seeing if a 
simple GENERIC kernel would lock up when I added that. I will go back and 
add WITNESS when you tell me theres nothing more we can get out of this lock 
up (recompiling will involve restarting the machine so I loose the 'boekn to 
debugger' state). Should I add anything else ? Skip spinlocks ? Invariants ?


The other thing we can do is revert UDP to using purely write locks -- the 
risk there is that it might change the timing but not actually resolve the 
bug, so if we can analyze it a bit using WITNESS first that would be 
useful.


Yes, I will run with WITNESS and anything else you might want. Is there 
anything else you, or anyone else, wants from this kernel ? It may take 
another day to lock up when I've restarted it unfortunately.


If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. 
WITNESS does a number of things, including tracking (and being judgemental 
about) lock order.  One nice side effect of that tracking is that we keep 
track of a lot more lock state explicitly, so DDB's show allocks, show 
locks, etc, commands can build on that.  show lockedvnods works without 
WITNESS, though, so your results so far suggest this is likely not related to 
vnode locking.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
 If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. 
 WITNESS does a number of things, including tracking (and being judgemental 
 about) lock order.  One nice side effect of that tracking is that we keep 
 track of a lot more lock state explicitly, so DDB's show allocks, show 
 locks, etc, commands can build on that.  show lockedvnods works without 
 WITNESS, though, so your results so far suggest this is likely not related to 
 vnode locking.

Right, I've gone back to my DEBUG kernel which has a lot of options in it,
including all the above. It has locked almost immediately luckily, so
now I have it sitting at the debugger prompt. The output from 'show alllocks'
is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

Which of these are worth tracing ?

-pte.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-16 Thread Pete French
Just confinuing to look at this with the help of Dimity, and the
output from 'bt' is here:

http://toybox.twisted.org.uk/~pete/71_bt.png

The top bit of that is from my 'show alllocks' the full version
of whih is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French
Just an update on this - I tried the various kernels, but now the machine is
not locking up at all. As I havent actually chnaged anything then this does
not make me as happy as you might expect. I don;t know what to do now - I 
daare not upgrade the machines to an OS that I know locks, but if I cant
make it lock then it is impossible to get any useful debugging info out
of.

maybe waiting for 7.2 is the best move...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson


On Thu, 15 Jan 2009, Pete French wrote:

Just an update on this - I tried the various kernels, but now the machine is 
not locking up at all. As I havent actually chnaged anything then this does 
not make me as happy as you might expect. I don;t know what to do now - I 
daare not upgrade the machines to an OS that I know locks, but if I cant 
make it lock then it is impossible to get any useful debugging info out of. 
maybe waiting for 7.2 is the best move...


Well, one slightly pessimistic (or realistic) view says that all software 
contains bugs, it's just a question of whether or not your workload and 
environment trigger those bugs in a noticeable way.


Given the inconsistency of the symptoms, I wouldn't preclude something 
environmental: could it be that it was the bottom, or more likely, top box in 
a rack and that your air conditioning isn't quite as effective there when the 
outside temperature is above/below some threshold?  Alternatively, could it be 
that the workload changed very slightly -- you're doing less DNS queries, or 
the network latency to the DNS server changed?


Certainly, whoever gave the advise on checking BIOS revisions is right: you 
can spend a lot of time tracking down a bug to realize that one box has a 
slightly different BIOS rev and therefore does/doesn't suffer from an obscure 
SMI bug.


In any case, if it starts to reproduceably recur, send out mail and we can see 
if we can track it down some more.  BTW, did you establish if the version of 
iLo you have has a remote NMI?  I seem to recall that some do, and being able 
to deliver an NMI is really quite valuable.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French
 Given the inconsistency of the symptoms, I wouldn't preclude something 
 environmental: could it be that it was the bottom, or more likely, top box in 
 a rack and that your air conditioning isn't quite as effective there when the 
 outside temperature is above/below some threshold?

It's a possibility - but the two machines which were exhibiting the fault
are in Slough and Baton Rouge respectively, so under very diferent cliatic
conditions. Howevere, something, has chhnaged to make it stop locking up!
The USA one was doing it every couple of hours at the start of the week, and
the UK on wouldnt last more than half an hour at one point.

 Alternatively, could it be that the workload changed very slightly -- you're
 doing less DNS queries, or the network latency to the DNS server changed?

Also a possibility - that workload is entirely dependent on customer behaviour
which is an unpredictable beast!

 Certainly, whoever gave the advise on checking BIOS revisions is right: you 
 can spend a lot of time tracking down a bug to realize that one box has a 
 slightly different BIOS rev and therefore does/doesn't suffer from an obscure 
 SMI bug.

Yes, thats next on my list - make sure they are all on the same version.

 In any case, if it starts to reproduceably recur, send out mail and we can see
 if we can track it down some more.  BTW, did you establish if the version of 
 iLo you have has a remote NMI?  I seem to recall that some do, and being able 
 to deliver an NMI is really quite valuable.

OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon
demand, so that could be used if/whhen the fault crops up again.

thanks, will let this lie for now and resurrect the thread when I can
get some more useful data.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson


On Thu, 15 Jan 2009, Pete French wrote:

In any case, if it starts to reproduceably recur, send out mail and we can 
see if we can track it down some more.  BTW, did you establish if the 
version of iLo you have has a remote NMI?  I seem to recall that some do, 
and being able to deliver an NMI is really quite valuable.


OK, thanks. My iiLO2 appears to have the ability to generate an NMI oon 
demand, so that could be used if/whhen the fault crops up again.


thanks, will let this lie for now and resurrect the thread when I can get 
some more useful data.


Excellent WRT NMI.  As long as you have DDB, KDB, and BREAK_TO_DEBUGGER 
compiled into the kernel, generating that should reliably get you into the 
debugger.  If it's possible to keep running with INVARIANTS and WITNESS, or 
just INVARIANTS if WITNESS slows things down too much, that would be 
desirable.  You might want to give the NMI a test run just to make sure it 
behaves as you think it should, though -- be aware that if DDB/KDB aren't 
compiled into the kernel, then an NMI will panic the box.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Pete French
 desirable.  You might want to give the NMI a test run just to make sure it 
 behaves as you think it should, though -- be aware that if DDB/KDB aren't 
 compiled into the kernel, then an NMI will panic the box.

Unfortunately it does this...

http://toybox.twisted.org.uk/~pete/71_nmi1.png

That is locked up too - hitting return does nothing. I was hoping it
was just garbled output but had actually gone to the debugger.
Apparently not.

Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER,
which does work as I have tested it with CTRL_ALT_ESC.

M

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread Robert Watson

On Thu, 15 Jan 2009, Pete French wrote:

desirable.  You might want to give the NMI a test run just to make sure it 
behaves as you think it should, though -- be aware that if DDB/KDB aren't 
compiled into the kernel, then an NMI will panic the box.


Unfortunately it does this...

http://toybox.twisted.org.uk/~pete/71_nmi1.png

That is locked up too - hitting return does nothing. I was hoping it was 
just garbled output but had actually gone to the debugger. Apparently not.


Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which 
does work as I have tested it with CTRL_ALT_ESC.


Er, that's rather upsetting.  John, do you have any ideas about this?

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-15 Thread John Baldwin
On Thursday 15 January 2009 12:49:11 pm Robert Watson wrote:
 On Thu, 15 Jan 2009, Pete French wrote:
 
  desirable.  You might want to give the NMI a test run just to make sure 
it 
  behaves as you think it should, though -- be aware that if DDB/KDB aren't 
  compiled into the kernel, then an NMI will panic the box.
 
  Unfortunately it does this...
 
  http://toybox.twisted.org.uk/~pete/71_nmi1.png
 
  That is locked up too - hitting return does nothing. I was hoping it was 
  just garbled output but had actually gone to the debugger. Apparently not.
 
  Thats with a config file containing KDB, DDB and BREAK_TO_DEBUGGER, which 
  does work as I have tested it with CTRL_ALT_ESC.
 
 Er, that's rather upsetting.  John, do you have any ideas about this?

The rest of the thread I have no context on still.  The garbage is due to 
competing panics I think.  The problem is we don't single thread the printf's 
in 'trap_fatal()'.  We should probably have some sort of simple spin lock 
thing in the x86 code to only allow 1 CPU at a time to run through that 
routine.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Pete French
 If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
 ctrl-alt-break on the console to see if you can drop into the debugger, or 
 issue a serial break on a serial console.

Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained
all the other stuff (WITNESS etc...). The end result...

...it no longer crashes :-(

I am not sure what to make of that! Wat could adding this to the kernel
possibly do which would make my problems go away ? Should I try just
adding this option to my GENERIC kernel and seeing if that also gives me
something stable ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Robert Watson

On Wed, 14 Jan 2009, Pete French wrote:

If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
ctrl-alt-break on the console to see if you can drop into the debugger, or 
issue a serial break on a serial console.


Well, I added BREAK_TO_DEBUGGER to the kernel config I had which contained 
all the other stuff (WITNESS etc...). The end result...


...it no longer crashes :-(

I am not sure what to make of that! Wat could adding this to the kernel 
possibly do which would make my problems go away ? Should I try just adding 
this option to my GENERIC kernel and seeing if that also gives me something 
stable ?


Yeah, that is unexpected -- the BREAK_TO_DEBUGGER path should have almost know 
effect on control flow, unlike, say, WITNESS, which significantly distorts 
timing.  Is there any chance you picked up any of the recent fixes that went 
into RELENG_7 without noticing, and that perhaps one of those did it?  With 
regard to what to do: if you didn't pick up a fix without noticing, yeah, I 
think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at 
least, masked) the problem.  Generally with this sort of testing one has to be 
pretty rigorous in testing assumptions, because it's easy for changes to sneak 
in.  Particularly annoying are seemingly innocuous code changes that do things 
like slightly rearrange kernel memory.


FWIW, I suspect the various reports we are seeing reflect more than one 
problem, and that they must be relatively edge-case individually but reports 
of a few problems have lead to more coming out of the woodwork.  Obviously, 
the problems are not edge-case to the people experiencing them...


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Pete French
 effect on control flow, unlike, say, WITNESS, which significantly distorts 
 timing.  Is there any chance you picked up any of the recent fixes that went 
 into RELENG_7 without noticing, and that perhaps one of those did it?  With 

I'm pretty certian of that - I hav just been changing kernel config
files, I havent actually csup'd at all.

 regard to what to do: if you didn't pick up a fix without noticing, yeah, I 
 think it's worth testing the hypothesis that BREAK_TO_DEBUGGER fixed (or at 
 least, masked) the problem.

OK. I think I need at leats 4 kernels to try here: GENERIC (which should
show the problenm), my original DEBUG (which also shows the problem) plus
both of those with BREAK_TO_DEBUGGER included to see if that fixes it. Can
I just add BREAK_TO_DEBUGGER on its own to a config file ? I was wondering
if I need to include one of the other debugger options so that it has
something to break to ?

 FWIW, I suspect the various reports we are seeing reflect more than one 
 problem, and that they must be relatively edge-case individually but reports 
 of a few problems have lead to more coming out of the woodwork.  Obviously, 
 the problems are not edge-case to the people experiencing them...

I was thinking that too - I've been guilty of this in the past too, lumping
my problem in with others under the asusmption that it's all the same. This
is onbiously pretty rare - out of 24 of the HP servers the problems only crops
up on 4 of them. But there is nothing dfferent about those 4.

I will let you know what my various kerenl compiles give me - am buolding
again from scratch, which is slow with WITNESS enabled.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-14 Thread Claus Guttesen
 my problem in with others under the asusmption that it's all the same. This
 is onbiously pretty rare - out of 24 of the HP servers the problems only crops
 up on 4 of them. But there is nothing dfferent about those 4.

Could it be different bios/firmware on the hp-servers?

Mr. Aliyev was unable to install 7.1 release on amd64 on a DL380 G5.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Doug Barton
Pete French wrote:
 Mine never lock up doing buildworlds either. They only lock up when they are
 sitting there more of less idle! The machines which have never locked up
 are the webservers, which are fairly heavlt loaded. The machine which locks
 up the most frequently is a box sitting there doing nothing but DNS, which is
 the most lightly loaded of the lot.

Silly question but do you have powerd enabled on that server? If so,
does disabling it help? Also do you have any of these in /etc/rc.conf
(i.e., they are not the same as the default values in
/etc/defaults/rc.conf):
performance_cx_lowest=HIGH# Online CPU idle state
performance_cpu_freq=NONE # Online CPU frequency
economy_cx_lowest=HIGH# Offline CPU idle state
economy_cpu_freq=NONE # Offline CPU frequency


Doug

-- 

This .signature sanitized for your protection

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Claus Guttesen
 Mine never lock up doing buildworlds either. They only lock up when they are
 sitting there more of less idle! The machines which have never locked up
 are the webservers, which are fairly heavlt loaded. The machine which locks
 up the most frequently is a box sitting there doing nothing but DNS, which is
 the most lightly loaded of the lot.

The server has been idle for a day now and is up and running. I have
then copied a file to generate some i/o and it copies without
problems.

for ((a=0;a10;a++))
  do
  cp netbeans-6.5-ml-macosx.dmg ${a}.dmg 
done

I can't  (fortunately) make it lock up. I have a DL360 G5 which is
unused atm. and can test on it if needed.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Gavin Atkinson
On Mon, 2009-01-12 at 19:00 +, Pete French wrote:
  I'm not sure if you've done this already, but the normal suggestions apply: 
  have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
  any results / panics / etc result?  Sometimes these debugging tools are 
  able 
  to convert hangs into panics, which gives us much more ability to debug 
  them. 
 
 OK, I have now had a machine hand again, with the correct debug options in
 the kernel. The screen looked like this when I went to restart it:
 
   http://toybox.twisted.org.uk/~pete/71_lor2.png
 
 It had not, however, dropped into any kind of debugger. Also there appear
 to me console messages after the lock order reversal - is that normal ?
 
 The machine did stay up for a signifanct amount of time before doing this. I
 notice that it is more or less identical to the one I posted whenI
 had WITNESS_KDB in the kernel too, so maybe those results arent
 entirely suprious after all ?
 
 Given it hasnt dropped to a debugger, is there anything else I can try ? 

Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
over the serial line?

Gavin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Pete French
 Lock order reversals are warnings of potential deadlock due to a lock cycle, 
 but deadlocks may not actually result, either because it's a false positive 
 (some locking construct that is deadlock free but involves lock cycles), or 
 because a cycle didn't actually form.  The message is suggestive, but if you 
 have significant system activity after the message, then it may be unrelated.

Its hard to tell in this case as there are no timestamps, so I cant
see if there is any activity after the lockup.

 Features like WITNESS and INVARIANTS may change the timing of the kernel 
 making certain race conditions less likely; I'd run with them for a bit and 
 see if you can reproduce the hang with them present, as they will make 
 debugging the problem a lot easier, if it's possible.

Uh, the above *was* me reproducing the hang with them present ;-)) It
quite happily hangs with thoise things in the kernel - indeed the next
hang was immediately after I rebooted the machine. But even with WITNESS
and INVARIANTS and all the rest it does not drop to a debugger, it
simply locks up.

That machine is currently turned off, but still has 7.1 installed. What
would you like me to try now ? I have a lockup I can reproduce pretty
reliably now (just wait and it will always lock up). I also found that
my other 7.1 box locks up fairly reliably when doing a buildworld.

The only similarily between these two machines and the ones which dont
lock up is that these are serving DNS. The others don't. Note that all
the hardware is identical, as is the installed software and the configuration.

I am at a total loss...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Pete French
 It was mentioned previous in this thread that CPUTYPE could be an
 issue. Did you change this if you customized your kernel?

Actually, I think thats been ruled out as a possible cause, along
with the scheduler. Certainly I have tried it both ways and
there is no difference, and I think i saw that the others had too.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Robert Watson


On Tue, 13 Jan 2009, Pete French wrote:

Features like WITNESS and INVARIANTS may change the timing of the kernel 
making certain race conditions less likely; I'd run with them for a bit and 
see if you can reproduce the hang with them present, as they will make 
debugging the problem a lot easier, if it's possible.


Uh, the above *was* me reproducing the hang with them present ;-)) It quite 
happily hangs with thoise things in the kernel - indeed the next hang was 
immediately after I rebooted the machine. But even with WITNESS and 
INVARIANTS and all the rest it does not drop to a debugger, it simply locks 
up.


That machine is currently turned off, but still has 7.1 installed. What 
would you like me to try now ? I have a lockup I can reproduce pretty 
reliably now (just wait and it will always lock up). I also found that my 
other 7.1 box locks up fairly reliably when doing a buildworld.


The only similarily between these two machines and the ones which dont lock 
up is that these are serving DNS. The others don't. Note that all the 
hardware is identical, as is the installed software and the configuration.


If you have BREAK_TO_DEBUGGER compiled into the kernel, then try pressing 
ctrl-alt-break on the console to see if you can drop into the debugger, or 
issue a serial break on a serial console.  For somewhat complicated reasons to 
explain, serial breaks are more effective at getting into the debugger, so are 
preferable -- also because you can more easily log output from the debugger.


If you are able to get into the debugger, the normal commands would be most 
helpful, especially if you can log the results:


  ps
  show lockedvnods
  show alllocks

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Pete French
 Can you break into the debugger with Ctrl-Alt-Esc, or by sending a break
 over the serial line?

No, ctrl-alt-esc doesnt work, and there is no serial line on the machine (not
that I can access anyway)

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Pete French
 Silly question but do you have powerd enabled on that server? If so,
 does disabling it help? Also do you have any of these in /etc/rc.conf
 (i.e., they are not the same as the default values in
 /etc/defaults/rc.conf):
 performance_cx_lowest=HIGH# Online CPU idle state
 performance_cpu_freq=NONE # Online CPU frequency
 economy_cx_lowest=HIGH# Offline CPU idle state
 economy_cpu_freq=NONE # Offline CPU frequency

No, none of those. My rc.conf is below. The only slightly unusual thing I
am doing is using lagg rather than the interfaces directly I guess, but
that has worked fine for ages.

-pete.


hostname=florentine.rattatosk
cloned_interfaces=lagg0
network_interfaces=lo0 bce0 bce1 lagg0
ifconfig_bce0=up
ifconfig_bce1=up
ifconfig_lagg0=laggproto lacp laggport bce0 laggport bce1

ipv4_addrs_lagg0=10.48.19.0/16 10.48.19.229/16 10.48.19.223/16 10.48.19.243/16 
10.48.19.226/16 10
.48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16 
10.48.19.232/16 10.4
8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16

defaultrouter=10.48.0.9

inetd_enable=YES
sshd_enable=YES

dhcpd_enable=YES
dhcpd_ifaces=lagg0
dhcpd_flags=-q
dhcpd_conf=/usr/local/etc/dhcpd.conf
dhcpd_withumask=022

nfs_client_enable=YES
nfs_server_enable=YES
portmap_enable=YES
rpcbind_enable=YES

named_enable=YES
pdns_enable=YES
pdns_recursor_enable=NO

mysql_enable=YES

apache22_http_accept_enable=YES
apache22_enable=YES

ntpd_enable=YES
ntpd_sync_on_start=YES

exim_enable=YES
exim_flags=-bd -q10m
sendmail_enable=NONE
sendmail_submit_enable=NO
sendmail_outbound_enable=NO
sendmail_msp_queue_enable=NO
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Pete French
 I can't  (fortunately) make it lock up. I have a DL360 G5 which is
 unused atm. and can test on it if needed.

Would it be possible to install that under amd64 and hammer it with
DNS requests ? I have been trying to think what the difference might be
between my webservers and the machines which are freezing, and the opnly
one I an come up with is UDP traffic as the locking machines are serving
DNS and also NFS.

-pete.
,.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


RE: Big problems with 7.1 locking up :-(

2009-01-13 Thread Nathan Way
I also am experiencing lock-ups on a server recently upgraded from
7.0-RELEASE to 7.1-STABLE.  This server is a Supermicro 6022 dual-Xeon
box running a GENERIC i386 SMP kernel.  Since upgrading to 7.1-STABLE it
has started locking up daily.  I see similar symptoms that Pete is
seeing - no ping response, no keyboard response, no video output on a
very lightly loaded server.  

I have a test machine with duplicate hardware to the one locking up that
I just finished installing 7.1-STABLE on but so far it hasn't locked up.
Coincidentally my locking machine is also a DNS server but I have not
enabled DNS on my test machine yet.

Since the locking server is remote to me, I need to downgrade it to 7.0
to get it stable again.  Once I finish that process, I can provide
remote access to the 7.1-STABLE machine in my office if anyone would
like to test with it.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Robert Watson


On Tue, 13 Jan 2009, Pete French wrote:

I can't (fortunately) make it lock up. I have a DL360 G5 which is unused 
atm. and can test on it if needed.


Would it be possible to install that under amd64 and hammer it with DNS 
requests ? I have been trying to think what the difference might be between 
my webservers and the machines which are freezing, and the opnly one I an 
come up with is UDP traffic as the locking machines are serving DNS and also 
NFS.


There are significant changes in UDP locking between 7.0 and 7.1, so it could 
be that we're looking at a regression there.  If you're able to reproduce this 
reliably, it might well be worth doing a little search-and-replace in 
udp_usrreq.c along the following lines:


  INP_RLOCK_ASSERT - INP_WLOCK_ASSERT
  INP_RLOCK - INP_WLOCK
  INP_RUNLOCK - INP_WUNLOCK

However, before making these changes for debugging purposes, make sure it's 
100% reproduceable without them in the configuration so that we don't find 
ourselves barking up the wrong tree.  Normally deadlocks along these lines 
*do* allow breaking into the debugger from a serial console, but since there 
are significant changes here in 7.1 it is worth trying to see if this might be 
related.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-13 Thread Ken Smith
On Mon, 2009-01-12 at 21:35 +0100, Tomas Randa wrote:
 I have similar problems. The last good kernel I have from stable 
 brach, october the 8. Then in next upgrade, I saw big problems with 
 performance.
 I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.
 
 Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
 lot of time with status waiting for opening table or waiting for 
 close tables
 
 I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
 areca SATA controller. Could not be problem in da device for example?
 
 Thanks Tomas Randa

Could you give r186860 a try?  It is an MFC into stable/7 so if the
machine in question is something you can experiment with just updating
to stable/7 would take care of it.  Otherwise if you could just manually
apply the patch to a 7.1 source tree and do a test build of the kernel
that would also do it.

I'm not experiencing lockups but this patch helped a lot on a machine I
have with a particular disk I/O pattern that resulted in extremely poor
performance with 7.1-RELEASE.  This patch brought it back to its normal
performance level.

Thanks.

-- 
Ken Smith
- From there to here, from here to  |   kensm...@cse.buffalo.edu
  there, funny things are everywhere.   |
  - Theodore Geisel |



signature.asc
Description: This is a digitally signed message part


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Claus Guttesen
 I've updagraded a test-webserver to 7.1 when it was released. After a
 few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
 has been running without any problems. The webserver is not heavily
 loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
 runs fine.

 If the reported lockup is due to i/o a buildworld will not be able to
 reproduce it.

 It has performed a buildworld without problems and I'll be doing some
 buildworlds throughout the day.

 This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
 build-in p200-controller with 64 MB ram.

Forgot to add that CPUTYPE=nocona in /etc/make.conf.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Claus Guttesen
 I am also surprised that this isn't more widely reported, as
 the hardware is very common. The only oddity with ym compile
 is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
 I will remove it anyway, just so I am actually building a completely
 vanilla amd64. That way I should have what everyone else has, and since
 I don't see anyone else saying they have isues then maybe mine will
 go away too (fingers crossed)

Intel suggests nocona for x86_64 platforms and prescott for x86
 (i386) based platforms on the 4.2 line, because they best matched the
 cache size and featureset of the Core2 processors.

I don't think that core2 support was fully completed in 4.2 (in
 fact I believe it was just started), and I don't think that our
 binutils supports it properly.

 Some thoughts,
 -Garrett

I've updagraded a test-webserver to 7.1 when it was released. After a
few days I upgraded a production-webserver to 7.1 on Jan. 8'th and it
has been running without any problems. The webserver is not heavily
loaded (load at 2-3 on average). I have made a buildworld -j 8 and it
runs fine.

If the reported lockup is due to i/o a buildworld will not be able to
reproduce it.

It has performed a buildworld without problems and I'll be doing some
buildworlds throughout the day.

This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
build-in p200-controller with 64 MB ram.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Claus Guttesen
 It has performed a buildworld without problems and I'll be doing some
 buildworlds throughout the day.

 This is on a HP c-class-blade with 8 GB ram, 2 x quad-core and the
 build-in p200-controller with 64 MB ram.

I've performed five buildworlds decrementing -j from 16 to 6 and I
can't lock up the server.

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Pete French
 I've performed five buildworlds decrementing -j from 16 to 6 and I
 can't lock up the server.

Mine never lock up doing buildworlds either. They only lock up when they are
sitting there more of less idle! The machines which have never locked up
are the webservers, which are fairly heavlt loaded. The machine which locks
up the most frequently is a box sitting there doing nothing but DNS, which is
the most lightly loaded of the lot.

I am going to roll back to 7.0 on all of the HP machines now, having
had yet another day of rebooting locked up machines. I will leave one
running 7.1 with the debug options in the kernel to try and get some
useful results out of this. All the machines are now running GENERIC with
no specail optimisations, CPU types or anything like that. Absolutely out
of the box vanilla 7.1/amd64 as far as I know :-(

-pete.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Robert Watson


On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 perfectly 
happily. I have been testing 7.1 in it's various incarnations for the last 
couple of months on our test server and it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] We've 
since then compiled the kernel under the BSD scheduler to rule that out, 
and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?


FWIW, the other guy I know who is having this problem had already switched 
to using ULE under 7.0-release, and did not have any problems with it.  So 
*his* problem was probably not related to SCHED_ULE, unless something has 
recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's going to 
try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that have 
existed in the code for a long time but never really manifested themselves. 
ULE is well shaken-out, having been under development for at least five years, 
but it is possible that some problems will become visible as a result of the 
switch.  I would encourage people to stick with ULE, but if you're having a 
stability problem then experimenting with scheduler as a variable that could 
be triggering the problem may well be useful to help track down the bug.  Most 
of the time the bugs will not be in ULE itself, rather, triggered because ULE 
will change the ordering or balancing of work in the system, so we should try 
to avoid situations where people switch to 4BSD from ULE and stick with it 
rather than getting the underlying problem fixed!


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Robert Watson

On Sat, 10 Jan 2009, Pete French wrote:

FWIW, the other guy I know who is having this problem had already switched 
to using ULE under 7.0-release, and did not have any problems with it.  So 
*his* problem was probably not related to SCHED_ULE, unless something has 
recently changed there.


Well, one of my machines just locked up again, even with SCHED_4BSD on it, 
so I am now thinking it is unrelated.


The machine has completely locked - no response to pings, no response to 
keypresses, nor to the power button. There is nothing printed on the console 
- it is just sitting there with a login prompt :-(


This is really not good - these are extremely common servers after all, and 
I am just running bog standard 7.1 with apache and mysql. This is happening 
across several different servers, all of which are slight variants on the 
DL360, so I dont think it is something perculiar to me.


I'm not sure if you've done this already, but the normal suggestions apply: 
have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
any results / panics / etc result?  Sometimes these debugging tools are able 
to convert hangs into panics, which gives us much more ability to debug them. 
If it still hangs rather than panicking, are you able to break into the 
debugger on the console?  If you're using a video console and not able to get 
to the debugger, would it be possible to configure a serial console and use 
that -- serial breaks are often more successful at getting to the debugger 
than keyboard breaks.  Likewise, I'm not sure if this hardware has an NMI 
button -- some HP servers have one on the motherboard that you can press -- 
but that is also potentially a way to get into the debugger the analyze the 
crash.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Pete French
 I'm not sure if you've done this already, but the normal suggestions apply: 
 have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
 any results / panics / etc result?  Sometimes these debugging tools are able 
 to convert hangs into panics, which gives us much more ability to debug them. 

I did, but it turns out I had an incorrect option in there which made the
data I got not relevent. I now have another machine running a kernel
with the following config:

include GENERIC
ident   DEBUG

options KDB
options DDB
options SW_WATCHDOG
options DEBUG_VFS_LOCKS
options MUTEX_DEBUG
options WITNESS
options LOCK_PROFILING
options INVARIANTS
options INVARIANT_SUPPORT
options DIAGNOSTIC

Those should enable me to get some useful output I hope.

 If it still hangs rather than panicking, are you able to break into the 
 debugger on the console?  If you're using a video console and not able to get 
 to the debugger, would it be possible to configure a serial console and use 

I cant add a sserial console - I am remote enough from most of
these machines (Slough) and very remote from the test box (its in the USA!)
so I cant get to them physicly. But I do have iLo which lets me use the
console and gives me a bit of access to the front. I will check for NMI.

Just had another lockup here - my working day has become a succession of
running round rebooting servers though iLo at the moment.

Will get back to you when the debug one has crashed - I could possibly
give you direct access to the iLo console on that if you need it ?

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Garance A Drosihn

At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various 
incarnations for the last couple of months on our test server and 
it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] 
We've since then compiled the kernel under the BSD scheduler to 
rule that out, and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any 
problems with it.  So *his* problem was probably not related to 
SCHED_ULE, unless something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's 
going to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that 
have existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development 
for at least five years, but it is possible that some problems will 
become visible as a result of the switch.  I would encourage people 
to stick with ULE, but if you're having a stability problem then 
experimenting with scheduler as a variable that could be triggering 
the problem may well be useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging problems.
So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...

--
Garance Alistair Drosehn=   g...@gilead.netel.rpi.edu
Senior Systems Programmer   or  g...@freebsd.org
Rensselaer Polytechnic Instituteor  dro...@rpi.edu
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Pete French
 I'm not sure if you've done this already, but the normal suggestions apply: 
 have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
 any results / panics / etc result?  Sometimes these debugging tools are able 
 to convert hangs into panics, which gives us much more ability to debug them. 

OK, I have now had a machine hand again, with the correct debug options in
the kernel. The screen looked like this when I went to restart it:

http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear
to me console messages after the lock order reversal - is that normal ?

The machine did stay up for a signifanct amount of time before doing this. I
notice that it is more or less identical to the one I posted whenI
had WITNESS_KDB in the kernel too, so maybe those results arent
entirely suprious after all ?

Given it hasnt dropped to a debugger, is there anything else I can try ? 

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Pete French
 Just to followup on this:  My friend did switch back to a 7.1 kernel with
 SCHED_4BSD, and he still ran into problems.  The error messages weren't

Acually, I dont know if I posted it, but that was the same for me too.
The scheduler makes no difference, nor do CPU copile settings.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Tomas Randa

Hello,

I have similar problems. The last good kernel I have from stable 
brach, october the 8. Then in next upgrade, I saw big problems with 
performance.

I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a 
lot of time with status waiting for opening table or waiting for 
close tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, 
areca SATA controller. Could not be problem in da device for example?


Thanks Tomas Randa

Garance A Drosihn wrote:

At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various 
incarnations for the last couple of months on our test server and 
it has performed perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] 
We've since then compiled the kernel under the BSD scheduler to 
rule that out, and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try 
that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any 
problems with it.  So *his* problem was probably not related to 
SCHED_ULE, unless something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's 
going to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that 
have existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development for 
at least five years, but it is possible that some problems will 
become visible as a result of the switch.  I would encourage people 
to stick with ULE, but if you're having a stability problem then 
experimenting with scheduler as a variable that could be triggering 
the problem may well be useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging 
problems.

So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Claus Guttesen
 I have similar problems. The last good kernel I have from stable brach,
 october the 8. Then in next upgrade, I saw big problems with performance.
 I tried ULE, 4BSD etc, but nothing helps, only downgrading system back.

 Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot
 of time with status waiting for opening table or waiting for close
 tables

 I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca
 SATA controller. Could not be problem in da device for example?

It was mentioned previous in this thread that CPUTYPE could be an
issue. Did you change this if you customized your kernel?

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Robert Watson

On Mon, 12 Jan 2009, Tomas Randa wrote:

I have similar problems. The last good kernel I have from stable brach, 
october the 8. Then in next upgrade, I saw big problems with performance. I 
tried ULE, 4BSD etc, but nothing helps, only downgrading system back.


Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot 
of time with status waiting for opening table or waiting for close 
tables


I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca 
SATA controller. Could not be problem in da device for example?


So far, this sounds like a different problem than the one others have been 
posting about, which involves full system freezes rather than specific 
processes wedging or responding poorly.  I'd suggest starting by using 
procstat -k on the process ID to look at where specific threads are waiting 
in the kernel.  Is it simply that MySQL is being unreasonably slow in certain 
situations, or does it actually entirely stop operating?


If you're able to narrow down the date on the 7.x branch where the problem 
you're experiencing begins, that would be most helpful.  I'd suggest leaving 
your userspace on the 8th october, and sliding the kernel forward in a binary 
search until you've narrowed it down a bit.  Obviously, this takes a bit of 
patience, but narrowing it down could be quite informative.


Robert N M Watson
Computer Laboratory
University of Cambridge



Thanks Tomas Randa

Garance A Drosihn wrote:

At 2:55 PM + 1/12/09, Robert Watson wrote:

On Fri, 9 Jan 2009, Garance A Drosihn wrote:


At 2:39 PM -0500 1/9/09, Robert Blayzor wrote:

On Jan 8, 2009, at 8:58 PM, Pete French wrote:
I have a number of HP 1U servers, all of which were running 7.0 
perfectly happily. I have been testing 7.1 in it's various incarnations 
for the last couple of months on our test server and it has performed 
perfectly.


I noticed a problem with 7.0 on a couple of Dell servers.  [...] We've 
since then compiled the kernel under the BSD scheduler to rule that out, 
and so far so good.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try 
that?


FWIW, the other guy I know who is having this problem had already 
switched to using ULE under 7.0-release, and did not have any problems 
with it.  So *his* problem was probably not related to SCHED_ULE, unless 
something has recently changed there.


Turns out he hasn't reverted back to 7.0-release just yet, so he's going 
to try SCHED_4BSD and see if that helps his situation.


Scheduler changes always come with some risk of exposing bugs that have 
existed in the code for a long time but never really manifested 
themselves. ULE is well shaken-out, having been under development for at 
least five years, but it is possible that some problems will become 
visible as a result of the switch.  I would encourage people to stick with 
ULE, but if you're having a stability problem then experimenting with 
scheduler as a variable that could be triggering the problem may well be 
useful to help track down the bug.


Just to followup on this:  My friend did switch back to a 7.1 kernel with
SCHED_4BSD, and he still ran into problems.  The error messages weren't
the same, but errors did happen in the same high disk-I/O situations as
the lockup happened with SCHED_ULE.  At this point he's fallen back to
the 7.0-kernel that he had been running (which also has SCHED_ULE), and
all the problems have gone away.  So at the moment he's running with a
7.0-ish kernel and the 7.1-release userland, without the hanging problems.
So the problem is something in the kernel, but it is *NOT* the scheduler
(at least, not in his case).

He is not eager to do a whole lot of experiments to track down the
problem, since this is happening on busy production machines and he
can't afford to have a lot of downtime on them (especially now that the
semester at RPI has started up).  The systems have some large (2 TB)
filesystems on them, and the lockups occur in high disk-I/O situations.
He's seeing the problem on one system which is a dual CPU quad-core
xeon, and another which is a 64 bit P4 with hyperthreading.  The one
thing in common between the two setups is that the boot drives + a
3ware controller (with its array of RAID disks) is moved from one
machine to the other one:

  its a 3ware 9500 12 port model, the boot drive is connected to
   an ICH6 in IDE mode, and yes, I've run it in single, single with
   hyper threading, and 8 way mode.  All 64 bit.

We still have no idea where the problem really is.  For all we know,
someone spilled a Pepsi on it when he wasn't looking...




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Robert Watson


On Mon, 12 Jan 2009, Pete French wrote:

I'm not sure if you've done this already, but the normal suggestions apply: 
have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do 
any results / panics / etc result?  Sometimes these debugging tools are 
able to convert hangs into panics, which gives us much more ability to 
debug them.


OK, I have now had a machine hand again, with the correct debug options in 
the kernel. The screen looked like this when I went to restart it:


http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear to 
me console messages after the lock order reversal - is that normal ?


Lock order reversals are warnings of potential deadlock due to a lock cycle, 
but deadlocks may not actually result, either because it's a false positive 
(some locking construct that is deadlock free but involves lock cycles), or 
because a cycle didn't actually form.  The message is suggestive, but if you 
have significant system activity after the message, then it may be unrelated.


The machine did stay up for a signifanct amount of time before doing this. I 
notice that it is more or less identical to the one I posted whenI had 
WITNESS_KDB in the kernel too, so maybe those results arent entirely 
suprious after all ?


Given it hasnt dropped to a debugger, is there anything else I can try ?


Features like WITNESS and INVARIANTS may change the timing of the kernel 
making certain race conditions less likely; I'd run with them for a bit and 
see if you can reproduce the hang with them present, as they will make 
debugging the problem a lot easier, if it's possible.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-12 Thread Robert Watson

On Mon, 12 Jan 2009, Garance A Drosihn wrote:

He is not eager to do a whole lot of experiments to track down the problem, 
since this is happening on busy production machines and he can't afford to 
have a lot of downtime on them (especially now that the semester at RPI has 
started up).  The systems have some large (2 TB) filesystems on them, and 
the lockups occur in high disk-I/O situations. He's seeing the problem on 
one system which is a dual CPU quad-core xeon, and another which is a 64 bit 
P4 with hyperthreading.  The one thing in common between the two setups is 
that the boot drives + a 3ware controller (with its array of RAID disks) is 
moved from one machine to the other one:


I think playing the combinatorics game on compile-time flags, kernel features, 
etc, is probably not the best way to go about debugging this.  Instead, I'd 
debug this as a kernel hang by breaking into the debugger once it occurs, if 
possible, and ideally on a serial console.  Often times hangs can be debugged 
looking solely at DDB output, or if possible, a crash dump.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-11 Thread Pete French
 I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
 deadlock, as it was triggered by lighttpd doing kern_sendfile, and
 never returning. The side effects (being unable to create processes,
 etc) is similar.

Interesting - did you get any responses from anyone else regarding
this ?  My last box which locked up was essentialy idle, so I am very
surprised by all of this - also none of the heavilt loaded machines
(i.e. the actual webservers) have locked up.

I am also surprised that this isn't more widely reported, as
the hardware is very common. The only oddity with ym compile
is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
I will remove it anyway, just so I am actually building a completely
vanilla amd64. That way I should have what everyone else has, and since
I don't see anyone else saying they have isues then maybe mine will
go away too (fingers crossed)

 My kernconf is below, try building the kernel, and send an email
 containing the backtrace from any process that has blocked (in my

OK, will do. I can try this on the one non-essential box which
locked up yesterday. I don't know how long it will before it
locks up again, but will see if I can do some things to provoke it.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-11 Thread Pete French
 My kernconf is below, try building the kernel, and send an email
 containing the backtrace from any process that has blocked (in my

Well, I havent managed to get a backtrace, but immediately upon
booting the system halts with the following:

http://www.twisted.org.uk/~pete/71_lor1.jpg

Interestingly, if I try and boot into safe mode then it will not
even get that far:

http://www.twisted.org.uk/~pete/71_safe1.jpg

Am going to try and backtrace that now to see what I can get. Unfortunately
I can only provide screen captures rather than actual text output from
this due to having to go via a Mac running RDP thought an ssh tunnel
to a Windows box and then using IE to go to the iLO :-) Convoluted,
but it works...

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-11 Thread Dylan Cochran
On Sun, Jan 11, 2009 at 11:27 AM, Pete French
petefre...@ticketswitch.com wrote:
 My kernconf is below, try building the kernel, and send an email
 containing the backtrace from any process that has blocked (in my

 Well, I havent managed to get a backtrace, but immediately upon
 booting the system halts with the following:

http://www.twisted.org.uk/~pete/71_lor1.jpg

Not Found
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-11 Thread Pete French
 Not Found

sorry, see the subsequent email, there are more links there to working PNG's

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-11 Thread Garrett Cooper
On Sun, Jan 11, 2009 at 4:45 AM, Pete French
petefre...@ticketswitch.com wrote:
 I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
 deadlock, as it was triggered by lighttpd doing kern_sendfile, and
 never returning. The side effects (being unable to create processes,
 etc) is similar.

 Interesting - did you get any responses from anyone else regarding
 this ?  My last box which locked up was essentialy idle, so I am very
 surprised by all of this - also none of the heavilt loaded machines
 (i.e. the actual webservers) have locked up.

 I am also surprised that this isn't more widely reported, as
 the hardware is very common. The only oddity with ym compile
 is that I set the CPUTYPE to 'core2' - that shouldnt have an effect, but
 I will remove it anyway, just so I am actually building a completely
 vanilla amd64. That way I should have what everyone else has, and since
 I don't see anyone else saying they have isues then maybe mine will
 go away too (fingers crossed)

 My kernconf is below, try building the kernel, and send an email
 containing the backtrace from any process that has blocked (in my

 OK, will do. I can try this on the one non-essential box which
 locked up yesterday. I don't know how long it will before it
 locks up again, but will see if I can do some things to provoke it.

 -pete.

Intel suggests nocona for x86_64 platforms and prescott for x86
(i386) based platforms on the 4.2 line, because they best matched the
cache size and featureset of the Core2 processors.

I don't think that core2 support was fully completed in 4.2 (in
fact I believe it was just started), and I don't think that our
binutils supports it properly.

Some thoughts,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-10 Thread Pete French
 FWIW, the other guy I know who is having this problem had already
 switched to using ULE under 7.0-release, and did not have any
 problems with it.  So *his* problem was probably not related to
 SCHED_ULE, unless something has recently changed there.

Well, one of my machines just locked up again, even with SCHED_4BSD
on it, so I am now thinking it is unrelated.

The machine has completely locked - no response to pings, no
response to keypresses, nor to the power button. There is nothing
printed on the console - it is just sitting there with a login prompt :-(

This is really not good - these are extremely common servers after all, and
I am just running bog standard 7.1 with apache and mysql. This is happening
across several different servers, all of which are slight variants on
the DL360, so I dont think it is something perculiar to me.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-10 Thread heliocentric
I noticed a similar problem testing 7.1-RC1, It seemed to be a deep
deadlock, as it was triggered by lighttpd doing kern_sendfile, and
never returning. The side effects (being unable to create processes,
etc) is similar.

My kernconf is below, try building the kernel, and send an email
containing the backtrace from any process that has blocked (in my
case, lighttpd attempting to sendfile a large amount of data to php
fastcgi triggered it, but that's a guess on my part). Note that this
includes witness, and invariants, so performance will be hit. Also,
enable watchdogd, and add -e 'ls -al /etc' to it's flags. It should
drop you to a debugger with a backtrace within a few seconds of the
lock being triggered, and it should output a backtrace and any
invariant/witness lock warnings. Obviously if you don't have a serial
or local console, don't do this.

include GENERIC
ident   DEBUG
options KDB
options DDB
options SW_WATCHDOG
options DEBUG_VFS_LOCKS
options INVARIANTS
options WITNESS

On 1/10/09, Pete French petefre...@ticketswitch.com wrote:
 FWIW, the other guy I know who is having this problem had already
 switched to using ULE under 7.0-release, and did not have any
 problems with it.  So *his* problem was probably not related to
 SCHED_ULE, unless something has recently changed there.

 Well, one of my machines just locked up again, even with SCHED_4BSD
 on it, so I am now thinking it is unrelated.

 The machine has completely locked - no response to pings, no
 response to keypresses, nor to the power button. There is nothing
 printed on the console - it is just sitting there with a login prompt :-(

 This is really not good - these are extremely common servers after all, and
 I am just running bog standard 7.1 with apache and mysql. This is happening
 across several different servers, all of which are slight variants on
 the DL360, so I dont think it is something perculiar to me.

 -pete.
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Guy Helmer

Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.

So the last two days I have been round upgrading all our servers, knowing
that I had run the system stably on identical hardware for some time.

Since then I have starte seeing machines lock up. This always happens under
heavy disc load. When I bring the machine back up then sometimes it fails
to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than  those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been written
to the drive, but never made it onto the disc).

The only time I have seen anything useful on the screen was during one lockup
where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.

Help! :-(

I am now downgrading all the machine to 7.0 as fast as I can - though the
machine I am trying to compile it on has locked up once during the compile
so I havent got anywhere so far.

The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.

  
I can't tell whether my situation is related, but I am seeing lockups on 
SMP Supermicro servers with both older (NetBurst-ish) and current Xeon 
CPUs.  I have been dropping into the kernel debugger and getting lock 
information and process backtraces, but so far nothing has been 
conclusively identified.  I think the issue I'm seeing was introduced 
sometime between October 2 and November 24 in the RELENG_7 branch, and I 
suppose the next step is to do a binary search for the offending change.


Guy

--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Mike Tancsa

At 09:49 AM 1/9/2009, Guy Helmer wrote:


RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.


I can't tell whether my situation is related, but I am seeing 
lockups on SMP Supermicro servers with both older (NetBurst-ish) and 
current Xeon CPUs.  I have been dropping into the kernel debugger 
and getting lock information and process backtraces, but so far 
nothing has been conclusively identified.  I think the issue I'm 
seeing was introduced sometime between October 2 and November 24 in 
the RELENG_7 branch, and I suppose the next step is to do a binary 
search for the offending change.


Are you using the same disk controller as Peter ?  Do both of you run 
with quotas on the file system ?  By lockup, do you mean it doesnt 
respond to the network either or just anything that needs disk IO ?


---Mike 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Pete French
 Are you using the same disk controller as Peter ?  Do both of you run 
 with quotas on the file system ?  By lockup, do you mean it doesnt 
 respond to the network either or just anything that needs disk IO ?

I dont think he can be using yhe same controller, as mine is an
embedded HPO unit. they do make a separate plugin one though - P400
SAS controller.

My symptoms are that the thing locks hard and respionds to nothing, no
keypresses or anything. I am assuming that the disc is the first thing to
go though, ebcause I see data which was being written to a file and a
processes reading from that file to the network. more of the file comes
over the network than makes it phyiscally onto the disc

The only useful error I ever saw was the message about spin
lock / turnstile locks being held for too long.

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Guy Helmer

Pete French wrote:
Are you using the same disk controller as Peter ?  Do both of you run 
with quotas on the file system ?  By lockup, do you mean it doesnt 
respond to the network either or just anything that needs disk IO ?



I dont think he can be using yhe same controller, as mine is an
embedded HPO unit. they do make a separate plugin one though - P400
SAS controller.

My symptoms are that the thing locks hard and respionds to nothing, no
keypresses or anything. I am assuming that the disc is the first thing to
go though, ebcause I see data which was being written to a file and a
processes reading from that file to the network. more of the file comes
over the network than makes it phyiscally onto the disc

The only useful error I ever saw was the message about spin
lock / turnstile locks being held for too long.

-pete.
  
OK, perhaps my issue is different then.  My symptoms seem to be a hang 
from anything that triggers a fork(), such as entering a command at a 
shell prompt or entering a user name at the console's login prompt.  
Network activity still works -- all the TCP connections stay up until I 
drop into the kernel debugger or power cycle.


Guy

--
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Robert Blayzor

On Jan 8, 2009, at 8:58 PM, Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various  
incarnations

for the last couple of months on our test server and it has performed
perfectly.



I noticed a problem with 7.0 on a couple of Dell servers.  Not sure if  
this is related but when our system froze the box was pingable, and  
you could switch virtual consoles... however, you could not type  
anything on the screen or connect to any sockets.  Num-lock would  
still work so the box wasn't solidly frozen.  This used to happen a  
couple of times every week or two.  We've since then compiled the  
kernel under the BSD scheduler to rule that out, and so far so good.   
(our box was a Dell PE1750, 2GB of RAM, amr RAID controller, bge  
network driver)  The primary application was just ntpd and apache with  
mpm_worker  threads.


Since ULE is now default in 7.1 and not in 7.0, perhaps you can try  
that?


--
Robert Blayzor, BOFH
INOC, LLC
rblay...@inoc.net
http://www.inoc.net/~rblayzor/



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Pete French
 Since ULE is now default in 7.1 and not in 7.0, perhaps you can try  
 that?

Actually you might be on to something there one of the main differences
between out test GL360 and the live ones is that the test one has less
cores in it, and is under less load. So multiprocessing problems may well
show up on the live where they wont on the test box. I shall try
building a kernel with the BSD scheduler adn see what happens there.
probbaly not today, as am loathe to cause anymore downtime right now.

thanks,

-pete.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problems with 7.1 locking up :-(

2009-01-09 Thread Garance A Drosihn

At 1:58 AM + 1/9/09, Pete French wrote:

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.

So the last two days I have been round upgrading all our servers, knowing
that I had run the system stably on identical hardware for some time.

Since then I have starte seeing machines lock up. This always happens
under heavy disc load. When I bring the machine back up then sometimes
it fails to fsck due to a partialy truncated inode. The locksup appear
to be disc related  [...]


One of my friends is also having trouble with lockups on two machines
he had upgraded to 7.1.  Also seems to be related to heavy disk I/O,
although I'm not sure the symptoms are the same as what you report.
Both machines had been running 7.0-release without trouble.  On at
least one of the systems, he's also working with (what I consider)
very large file systems (over 2 TB).  Both machines are using a 3ware
controller with its RAID.

I realize that isn't much to go on, but it suggests that there is
some problem wider than just your (Pete's) usage.  I think his
situation is such that lockups like this are simply not acceptable,
and the last I heard he was reverting back to 7.0-release.

--
Garance Alistair Drosehn=   g...@gilead.netel.rpi.edu
Senior Systems Programmer   or  g...@freebsd.org
Rensselaer Polytechnic Instituteor  dro...@rpi.edu
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


  1   2   >