Re: NFS Locking Issue

2006-07-03 Thread Robert Watson


On Mon, 3 Jul 2006, Kostik Belousov wrote:


On Mon, Jul 03, 2006 at 12:50:11AM -0400, Francisco Reyes wrote:

Kostik Belousov writes:

Since nobody except you experience that problems (at least, only you
notified
about the problem existence)


Did you miss the part of:


User Freebsd writes:

Since there are several of us experiencing what looks to be the same sort
of deadlock issue, I beseech you not to give up


I am not the only one reporting or having the issue.

I think you have different issues.


I agree.  It looks like we have several issues floating around.  There are 
some known issues with rpc.lockd (and probably some unknown ones) that will 
require a concerted effort to resolve.  There appear to be a number of reports 
relating to this/these problems.


It sounds like there is also an NFS client race condition or other bug of some 
sort.


I think it would be really useful to isolate the two during debugging. 
Specifically, to make sure that the second client bug is reproduceable without 
rpc.lockd running on the client (and related mount flags).  Once we have some 
more information, such as vnode locking information, client thread stack 
traces, etc, we should probably get Mohan in the loop if things seem sticky. 
I believe he was on vacation last week; he may be back this week sometime. 
With the July 4 weekend afoot, a lot of .us developers are offline.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.1 Tor issues (Once More, with Feeling)

2006-07-03 Thread Fabian Keil
Dan Nelson [EMAIL PROTECTED] wrote:

 In the last episode (Jul 02), Robert Watson said:
  On Sun, 2 Jul 2006, Fabian Keil wrote:
  The ssh man page offers:
  
  |~B  Send a BREAK to the remote system (only useful for SSH
  |protocol version 2 and if the peer supports it).
  
  I am using ssh 2, but the only reaction I get is a new line.
  
  |FreeBSD/i386 (tor.fabiankeil.de) (ttyd0)
  |
  |login: ~B
 
 If you enter ~B and actually see a ~B printed to the screen, then ssh
 didn't process it because you didn't hit cr first.  So cr~B will
 tell ssh to send a break.

I am actually using cr~B and I don't see just ~B,
but ~B
. The tilde is printed after I release B, therefore I
guess it is working.
 
  It sounds like your serial console server may not know how to map
  SSH break signals into remote serial break signals.  Try
  ALT_BREAK_TO_DEBUGGER.  Here's the description from NOTES:
  
  # Solaris implements a new BREAK which is initiated by a character
  # sequence CR ~ ^b which is similar to a familiar pattern used on
  # Sun servers by the Remote Console.
  options ALT_BREAK_TO_DEBUGGER
 
 ... and if you're sshing to your terminal server, remember that ssh
 will eat that tilde (because you sent cr~ ), so you need to send
 cr~~^B to pass the right characters to FreeBSD.  Or change ssh's
 escape character with the -e flag.

cr~^b works for me, without touching any ssh settings.
As cr~. is still causing a disconnect, it doesn't look
like the escape character was changed either.

Fabian
-- 
http://www.fabiankeil.de/


signature.asc
Description: PGP signature


Re: FreeBSD 6.1 Tor issues (Once More, with Feeling)

2006-07-03 Thread Fabian Keil
Fabian Keil [EMAIL PROTECTED] wrote:

 Robert Watson [EMAIL PROTECTED] wrote:

  It sounds like your serial console server may not know how to map
  SSH break signals into remote serial break signals.  Try
  ALT_BREAK_TO_DEBUGGER.  Here's the description from NOTES:
  
  # Solaris implements a new BREAK which is initiated by a character
  # sequence CR ~ ^b which is similar to a familiar pattern used on
  # Sun servers by the Remote Console.
  options ALT_BREAK_TO_DEBUGGER
 
 It took me several attempts to get the character sequence right,
 but yes, this one works. Thanks.

Unfortunately it didn't work while the system was hanging
this morning. I wasn't logged in at the console before the
hang occurred, so it maybe that the terminal server checked
the console for life signs, found none and did neither
connect nor print a warning (wild guess I have no idea
if it does that).

It could also mean that I'm seeing the mysterious power off part
described in: http://www.freebsd.org/cgi/query-pr.cgi?pr=95180
but I have no way to tell the difference.

I will stay connected to the console until the system hangs
again to see if it changes anything.

Fabian
-- 
http://www.fabiankeil.de/


signature.asc
Description: PGP signature


Re: NFS Locking Issue

2006-07-03 Thread Kostik Belousov
On Mon, Jul 03, 2006 at 10:06:52AM +0100, Robert Watson wrote:
 
 On Mon, 3 Jul 2006, Kostik Belousov wrote:
 
 On Mon, Jul 03, 2006 at 12:50:11AM -0400, Francisco Reyes wrote:
 Kostik Belousov writes:
 Since nobody except you experience that problems (at least, only you
 notified
 about the problem existence)
 
 Did you miss the part of:
 
 User Freebsd writes:
 Since there are several of us experiencing what looks to be the same 
 sort
 of deadlock issue, I beseech you not to give up
 
 I am not the only one reporting or having the issue.
 I think you have different issues.
 
 I agree.  It looks like we have several issues floating around.  There are 
 some known issues with rpc.lockd (and probably some unknown ones) that will 
 require a concerted effort to resolve.  There appear to be a number of 
 reports relating to this/these problems.
 
 It sounds like there is also an NFS client race condition or other bug of 
 some sort.
 
 I think it would be really useful to isolate the two during debugging. 
 Specifically, to make sure that the second client bug is reproduceable 
 without rpc.lockd running on the client (and related mount flags).  Once we 
 have some more information, such as vnode locking information, client 
 thread stack traces, etc, we should probably get Mohan in the loop if 
 things seem sticky. I believe he was on vacation last week; he may be back 
 this week sometime. With the July 4 weekend afoot, a lot of .us developers 
 are offline.
I too did noted some time ago that unresposible nfs server takes
nfs client down. I then looked at the issue, and have the impression
that this is again the case of runningbufspace depletion. I got a lot
of processes in wdrain and flswai states. After nfs server repaired,
active write requests were executed, number of dirty buffers decreased,
and system returned to normal operation.

This seems to be an architectural issue. I tried to bring discussion up
several month ago, but got no response.

And, there is the small problem about SIGINT being ignored when mounted
with intr flag. Patch to fix this is attached in my previous mail.



pgpJkB9m4Wicz.pgp
Description: PGP signature


Re: NFS Locking Issue

2006-07-03 Thread Andrew Reilly
On Mon, Jul 03, 2006 at 10:06:52AM +0100, Robert Watson wrote:
 It sounds like there is also an NFS client race condition or other bug of 
 some sort.

It may not be related, directly, but one thing that I noticed,
while trying to sort out my own recently commissioned NFS setup,
is that the -r1024 mount flag is *crucial* when the network is
100BaseT and the server is a new, fast amd64 box, and the client
is an old P3-500 with a RealTek ethernet card.  It works fine,
now, but tcpdump showed that it was retrying forever without.
Even NFS over TCP seemed to suffer a bunch of error-related
retries which amounted to stalls in the client.

Is there any way for this sort of thing to be adjusted
automatically?

Cheers,

-- 
Andrew
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS Locking Issue

2006-07-03 Thread Michel Talon
 So it would appear that you cured the NFS problems inherent with FBSD-6
 by replacing FBSD with Fedora Linux. Nice to know that NFSd works in Linux.
 But won't help those on the FBSD list fix their FBSD-6 boxen. :/


First NFS is designed to make machines of different OSs interact properly.
If a FreeBSD server interacts properly with a FreeBSD client, but not other
clients, you cannot say that the situation is fine.
Second i am not the one to chose the NFS server, there are people working
in social groups, in the real world.

And third, the most important, the OP message seemed to imply that the
FreeBSD-6 NFS client was at fault, i pointed out that in my experience my
FreeBSD-6.1 client works OK, while the 6.0 doesn't, when  interacting with a
FC5 server. This is in itself a relevant piece of information for the problem
at hand. It may be that the server side is at fault, or some complex
interaction between client and server.

Anyways some people claimed here that they had no problem with FreeBSD-5
clients and servers. My experience is that i had constant problems 
between FreeBSD-5 clients and Fedora Core 3 servers. I cannot provide any
other data point. I am not particularly sure of the quality of the FC3 or
FC5 NFS server implementation, except that the ~ 100 workstations 
running the similar Fedora distribution work like a charm with their homes
NFS mounted on the server. On  the other hand a Debian client machine also has
severe NFS problems. My only conclusion is that these NFS stories are very
tricky. The only moment everything worked fine was when we were running
Solaris on the server.

 
-- 

Michel TALON

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS Locking Issue

2006-07-03 Thread User Freebsd

On Mon, 3 Jul 2006, Francisco Reyes wrote:


Kostik Belousov writes:


I think that then 6.2 and 6.3 is not for you either. Problems
cannot be fixed until enough information is given.


I am trying.. but so far only other users who are having the same problem are 
commenting on this and other simmilar threads.


We just need some guidance..

Mark gave me a URL to turn on debugging and volunteered ot give me some 
pointers.. I will try, but I will likely try on my own time, on my own 
machines.. I can not tell the owner of the company I work for to let me 
try.. or play around in production machines.. as we loose customers 
because of current problems with the 6.X line. 
Since nobody except you experience that problems (at least, only you 
notified

about the problem existence)


Did you miss the part of:


User Freebsd writes:

Since there are several of us experiencing what looks to be the same sort
of deadlock issue, I beseech you not to give up


I am not the only one reporting or having the issue.


Careful here, I think this is where things are getting confused ... the 
above is related to the deadlock (high vmstat blockd issue), not the NFS 
issue ... we're getting two different issues confused :)



improved handling of signals in nfs client. If you could test it, that
would be useful.


Does it matter if the OS is i386 or am64?
Have an amd64 machine I can more easily play with... with no risk to 
production.


Does the amd64 machine exhibit the same problem?


Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email . [EMAIL PROTECTED]  MSN . [EMAIL PROTECTED]
Yahoo . yscrappy   Skype: hub.orgICQ . 7615664
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS Locking Issue

2006-07-03 Thread Chuck Swiger

Michel Talon wrote:
[ ...a long email snipped... ]

My only conclusion is that these NFS stories are very
tricky. The only moment everything worked fine was when we were running
Solaris on the server.


I can't speak to the earlier part about NFS with Linux, but at least I very 
much agree with your conclusion: Solaris makes one of the best NFS servers 
available, over a broad range of use cases.


However, I also wish to note that if you want to use NFS and you need remote 
locking to work, your best hope is when the software you use is willing to use 
explicit lockfiles rather than depending on rpc.lockd to provide remote 
flock()/lockf()-style locking.


There are plenty of software out there which includes locking tests (sendmail 
does, UWash IMAP does, Perl does, etc), and my observation has been that 
actually using NFS-based remote locking under anything beyond trivial load 
tends to make rpc.lockd terminate within seconds (maybe with a core dump, if 
you get lucky), or end up with processes getting stuck forever waiting on 
locks that don't ever return because they've been lost somewhere in limbo.


YMMV.  :-)

--
-Chuck
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Which FreeBSD is the most stable for Dell PowerEdge 2850

2006-07-03 Thread Vivek Khera


On Jun 30, 2006, at 8:08 PM, Dan Charrois wrote:

In any case, the server is used heavily all year except July, so  
this is my time of year to take things apart, update software,  
etc.  And so I'm wondering - what is the recommended version of  
FreeBSD I should be running if stability is of the utmost  
importance?  Should I migrate to the 6.x stream?  Is it relatively  
solid?  Or should I stay with 5.4 for now?  I've seen some messages  
posted periodically from various people running into problems,


I don't have any 2850's but the 1850 I have has been running 6.0  
since the BETA1, and last night just upgraded it to 6.1.  No issues.   
The PERC 4e/Si card is phenominally fast on this system (running 2  
disk RAID1).  I'd recommend you to run 6.1 as it is stable on all of  
my Dell systems that run it (and I'm migrating the older FreeBSD  
boxes to 6.1 as time permits).


If you already have  1 CPU, you might as well leave hyperthreading  
off.  There are cases where it degenerates performance rather than  
enhance it.


As for mysql version, no comment :-)



Re: NFS Locking Issue

2006-07-03 Thread Garance A Drosihn

At 9:13 PM -0400 7/1/06, Francisco Reyes wrote:

John Hay writes:


I only started to see the lockd problems when upgrading
the server side to FreeBSD 6.x and later. I had various
FreeBSD clients, between 4.x and 7-current and the lockd
problem only showed up when upgrading the server from
5.x to 6.x.


It confirms the same we are experiencing.. constant
freezing/locking issues.  I guess no more 6.X for us.. for
the foreseable future..


I don't know if this will be of any help to anyone,
but...

I recently moved a network-based service from a 4.x machine
to a 6.x machine.  Despite some testing in advance of the
switch, many people had problems with the service.  I booted
to a somewhat out-of-date snapshot of 5.x on the same box.
I still had problems, but it didn't seem as bad, so I stuck
with the 5.x system.  Some problems turned out to be bugs
in the service itself, and were eventually found and fixed.

However, one set of problems on that out-of-date snapshot
of 5.x were solved by adding:

net.inet.tcp.rfc1323=0

to /etc/sysctl.conf.  The guy who suggested that said it
avoided a bug which was fixed in later versions of either
5.x or 6.x, I forget which.  Of interest is that the bug
was such that some people connecting to the service were
never bothered by the bug, while other people could not use
the service at all until I turned off tcp.rfc1323 .

I have a test version of the same service running on a
different FreeBSD/i386 box, and that box is now updated
to freebsd-stable as of June 10th.  Lo and behold, someone
connecting to that test box reported some problems.  So I
typed in 'sysctl net.inet.tcp.rfc1323=0', and his problem
immediately disappeared.  So, it might be that there is
still some problem with the rfc1323 processing, or that the
bug which had been fixed has somehow been re-introduced.

In any case, people who are experiencing problems with NFS
might want to try that, and see if it makes any difference.
It does strike me as odd that some people are having a *lot*
of trouble with NFS under 6.x, while others seem to be okay
with it.  Perhaps the difference is the network topology
between the NFS server and the NFS clients.

Obviously, this is nothing but a guess on my part.  I am
not a networking guru!

--
Garance Alistair Drosehn=   [EMAIL PROTECTED]
Senior Systems Programmer   or  [EMAIL PROTECTED]
Rensselaer Polytechnic Instituteor  [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS Locking Issue

2006-07-03 Thread Michael Collette

Garance A Drosihn wrote:

At 9:13 PM -0400 7/1/06, Francisco Reyes wrote:

John Hay writes:


I only started to see the lockd problems when upgrading
the server side to FreeBSD 6.x and later. I had various
FreeBSD clients, between 4.x and 7-current and the lockd
problem only showed up when upgrading the server from
5.x to 6.x.


It confirms the same we are experiencing.. constant
freezing/locking issues.  I guess no more 6.X for us.. for
the foreseable future..


I don't know if this will be of any help to anyone,
but...

I recently moved a network-based service from a 4.x machine
to a 6.x machine.  Despite some testing in advance of the
switch, many people had problems with the service.  I booted
to a somewhat out-of-date snapshot of 5.x on the same box.
I still had problems, but it didn't seem as bad, so I stuck
with the 5.x system.  Some problems turned out to be bugs
in the service itself, and were eventually found and fixed.

However, one set of problems on that out-of-date snapshot
of 5.x were solved by adding:

net.inet.tcp.rfc1323=0

to /etc/sysctl.conf.  The guy who suggested that said it
avoided a bug which was fixed in later versions of either
5.x or 6.x, I forget which.  Of interest is that the bug
was such that some people connecting to the service were
never bothered by the bug, while other people could not use
the service at all until I turned off tcp.rfc1323 .

I have a test version of the same service running on a
different FreeBSD/i386 box, and that box is now updated
to freebsd-stable as of June 10th.  Lo and behold, someone
connecting to that test box reported some problems.  So I
typed in 'sysctl net.inet.tcp.rfc1323=0', and his problem
immediately disappeared.  So, it might be that there is
still some problem with the rfc1323 processing, or that the
bug which had been fixed has somehow been re-introduced.

In any case, people who are experiencing problems with NFS
might want to try that, and see if it makes any difference.
It does strike me as odd that some people are having a *lot*
of trouble with NFS under 6.x, while others seem to be okay
with it.  Perhaps the difference is the network topology
between the NFS server and the NFS clients.

Obviously, this is nothing but a guess on my part.  I am
not a networking guru!



Thanks for the try Garance, but in my setup it didn't make any 
difference.  I'll get into a bit more detail about my setup in another post.


Later on,
--
Michael Collette
IT Manager
TestEquity Inc
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: trap 12: supervisor write, page not present on 6.1-STABLE Tue May 16 2006

2006-07-03 Thread Stanislaw Halik
On Fri, Jun 30, 2006, Robert Watson wrote:
 Thanks for testing the patch -- it looks like there's a more pressing 
 logical problem in this code!  Could you try the following simpler patch:

 http://www.watson.org/~robert/freebsd/netperf/ip_ctloutput.diff

 The IP option code seems not to know that (in RELENG_6 and before) the pcb 
 is discarded on disconnect, and the application is querying the TTL after a 
 disconnect.  In FreeBSD 7.x, the pcb is preserved after disconnect so this 
 succeeds.

I'm running with the patch applied for 3 days straight and the machine
didn't crash once. Please, consider merging it to RELENG_6.


pgpFABO0jK0gx.pgp
Description: PGP signature


Re: NFS Locking Issue

2006-07-03 Thread Michael Collette

User Freebsd wrote:

On Sat, 1 Jul 2006, Francisco Reyes wrote:


John Hay writes:


I only started to see the lockd problems when upgrading the server side
to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x
and 7-current and the lockd problem only showed up when upgrading the
server from 5.x to 6.x.


It confirms the same we are experiencing.. constant freezing/locking 
issues.

I guess no more 6.X for us.. for the foreseable future..


Since there are several of us experiencing what looks to be the same 
sort of deadlock issue, I beseech you not to give up


Honestly trying not to.  To tell ya the truth, I've been giving a real 
hard look at Ubuntu for my serving needs.  This NFS thing has got me 
seriously questioning FreeBSD right at the moment.


... right now, all 
we've been able to get to the developers is virtually useless 
information (vmstat and such shows the problem, but it doesn't allow 
developers to identify the problem) ...


Is this a problem that you can easily recreate, even on a non-production 
machine?


Oh yeah.  I've got a couple of ways I'm able to get this to fail.

Method #1:
-
Let's start with the simplest.  The scenario here involves 2 machines, 
mach01 and mach02.  Both are running 6-STABLE, and both are running 
rpcbind, rpc.statd, and rpc.lockd.  mach01 has exported /documents and 
mach02 is mounting that export under /mnt.  Simple enough?


The /documents directory has multiple subdirectories and files of 
various sizes.  The actual amount of data doesn't really matter to 
produce a failure.  All you need to do at this point is to try to copy 
files from that mount point to somewhere else on the hard drive.


cp -Rp /mnt/* /tmp/documents/

You may, or not, see that a couple of subdirectories were created, but 
no files actually moved over.  The cp command is now locked up, and no 
traffic moves.  This usually takes a second or two to show up as a 
problem.  I can repeat this with multiple 6-STABLE boxes.


Turn off rpc.lockd on either the server or client before the cp command, 
and things work.


Method #2:
-
Booting to a diskless work station.  The server (mach01) has exported 
/usr, /usr/local, /usr/X11R6 and enough other stuff to get a diskless 
workstation up and running.  Not going to get into all the details here 
other than to say that I have a fully functioning setup like this on 5.4 
boxes now.


I've knocked the boot up of the diskless client (mach02) down to console 
only.  Once at the console I startx with a regular user, taking me in to 
twm.  From there I try to launch a KDE application, which in my test 
case is kwrite.  The same situation is true with launching a GTK app, 
such as Gimp.


X and twm start up.  I've got all the rest of the system reasonably 
functional.  When I try to run kwrite, none of the KDE subsystems start 
up.  kwrite just sits there in a lockd state.  Same is true of Gimp.


If I shutdown rpc.lockd on either machine I'm able to bring up a full 
KDE desktop, with all applications able to run.


Other Testing:
-
At one point we had in our test network a 6.1 NFS server providing files 
to 5.4 diskless clients without any problems.  We first got to noticing 
the bulk of the glitches when I moved the diskless setup to use a 6.1 
kernel.


As I said, I've been looking at Linux alternatives.  Especially after 
reading about Michel Talon's experiences with Fedora.  I initially tried 
CentOS, but wasn't able to get NFS working properly on that thing.  I 
had an Ubuntu CD handy, so I installed it on a test box.  Wow, does that 
NFS server boogie!


Using Ubuntu as the server I connected a FreeBSD 5.4 and 6-stable box as 
clients on a 100Mb/s network.  The time trial used a dummy 100Meg file 
transfered from the server to the client.  We measured 90Mb/s transfer, 
which was FAR faster than I had ever been able to get 2 FreeBSD boxes to 
perform doing similar tests.


I then used Ubuntu to connect to a 5.4 server we have in production.  I 
don't recall the exact stats, but it was close to 10x slower.  No 
lockups here though.


After the 4th of July I intend to test Ubuntu as a client to a FreeBSD 
6-STABLE server on a gigabit lan to run similar time trials.  I'm 
looking to confirm what I can only suspect at this point, which is that 
the NFS server on FreeBSD is mucked up, but the client is okay.


As time allows I hope to run similar tests between two Ubuntu boxes, 
then run it all again with Fedora.  Seriously debating whether to move 
some or all of our infrastructure to Linux after all this.  A 3-4 month 
old known bug like this gives me a great deal of concern about FreeBSD. 
 That, and Ubuntu's NFS server speed just about knocked me over!


 In my case, I have one machine fully configured for debugging, 
but, of 

Re: trap 12: supervisor write, page not present on 6.1-STABLE Tue May 16 2006

2006-07-03 Thread Robert Watson


On Tue, 4 Jul 2006, Stanislaw Halik wrote:


On Fri, Jun 30, 2006, Robert Watson wrote:
Thanks for testing the patch -- it looks like there's a more pressing 
logical problem in this code!  Could you try the following simpler patch:



http://www.watson.org/~robert/freebsd/netperf/ip_ctloutput.diff


The IP option code seems not to know that (in RELENG_6 and before) the pcb 
is discarded on disconnect, and the application is querying the TTL after a 
disconnect.  In FreeBSD 7.x, the pcb is preserved after disconnect so this 
succeeds.


I'm running with the patch applied for 3 days straight and the machine 
didn't crash once. Please, consider merging it to RELENG_6.


I have committed this as ip_output.c:1.242.2.9 in the RELENG_6 branch, and 
will also merge to RELENG_5 in a few days.  Assuming this settles well, I'll 
talk to the RE team about doing an errata patch for this in the RELENG_6_1 
branch.


Thanks!

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]