Re: NFS Locking Issue
On Mon, 3 Jul 2006, Kostik Belousov wrote: On Mon, Jul 03, 2006 at 12:50:11AM -0400, Francisco Reyes wrote: Kostik Belousov writes: Since nobody except you experience that problems (at least, only you notified about the problem existence) Did you miss the part of: User Freebsd writes: Since there are several of us experiencing what looks to be the same sort of deadlock issue, I beseech you not to give up I am not the only one reporting or having the issue. I think you have different issues. I agree. It looks like we have several issues floating around. There are some known issues with rpc.lockd (and probably some unknown ones) that will require a concerted effort to resolve. There appear to be a number of reports relating to this/these problems. It sounds like there is also an NFS client race condition or other bug of some sort. I think it would be really useful to isolate the two during debugging. Specifically, to make sure that the second client bug is reproduceable without rpc.lockd running on the client (and related mount flags). Once we have some more information, such as vnode locking information, client thread stack traces, etc, we should probably get Mohan in the loop if things seem sticky. I believe he was on vacation last week; he may be back this week sometime. With the July 4 weekend afoot, a lot of .us developers are offline. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.1 Tor issues (Once More, with Feeling)
Dan Nelson [EMAIL PROTECTED] wrote: In the last episode (Jul 02), Robert Watson said: On Sun, 2 Jul 2006, Fabian Keil wrote: The ssh man page offers: |~B Send a BREAK to the remote system (only useful for SSH |protocol version 2 and if the peer supports it). I am using ssh 2, but the only reaction I get is a new line. |FreeBSD/i386 (tor.fabiankeil.de) (ttyd0) | |login: ~B If you enter ~B and actually see a ~B printed to the screen, then ssh didn't process it because you didn't hit cr first. So cr~B will tell ssh to send a break. I am actually using cr~B and I don't see just ~B, but ~B . The tilde is printed after I release B, therefore I guess it is working. It sounds like your serial console server may not know how to map SSH break signals into remote serial break signals. Try ALT_BREAK_TO_DEBUGGER. Here's the description from NOTES: # Solaris implements a new BREAK which is initiated by a character # sequence CR ~ ^b which is similar to a familiar pattern used on # Sun servers by the Remote Console. options ALT_BREAK_TO_DEBUGGER ... and if you're sshing to your terminal server, remember that ssh will eat that tilde (because you sent cr~ ), so you need to send cr~~^B to pass the right characters to FreeBSD. Or change ssh's escape character with the -e flag. cr~^b works for me, without touching any ssh settings. As cr~. is still causing a disconnect, it doesn't look like the escape character was changed either. Fabian -- http://www.fabiankeil.de/ signature.asc Description: PGP signature
Re: FreeBSD 6.1 Tor issues (Once More, with Feeling)
Fabian Keil [EMAIL PROTECTED] wrote: Robert Watson [EMAIL PROTECTED] wrote: It sounds like your serial console server may not know how to map SSH break signals into remote serial break signals. Try ALT_BREAK_TO_DEBUGGER. Here's the description from NOTES: # Solaris implements a new BREAK which is initiated by a character # sequence CR ~ ^b which is similar to a familiar pattern used on # Sun servers by the Remote Console. options ALT_BREAK_TO_DEBUGGER It took me several attempts to get the character sequence right, but yes, this one works. Thanks. Unfortunately it didn't work while the system was hanging this morning. I wasn't logged in at the console before the hang occurred, so it maybe that the terminal server checked the console for life signs, found none and did neither connect nor print a warning (wild guess I have no idea if it does that). It could also mean that I'm seeing the mysterious power off part described in: http://www.freebsd.org/cgi/query-pr.cgi?pr=95180 but I have no way to tell the difference. I will stay connected to the console until the system hangs again to see if it changes anything. Fabian -- http://www.fabiankeil.de/ signature.asc Description: PGP signature
Re: NFS Locking Issue
On Mon, Jul 03, 2006 at 10:06:52AM +0100, Robert Watson wrote: On Mon, 3 Jul 2006, Kostik Belousov wrote: On Mon, Jul 03, 2006 at 12:50:11AM -0400, Francisco Reyes wrote: Kostik Belousov writes: Since nobody except you experience that problems (at least, only you notified about the problem existence) Did you miss the part of: User Freebsd writes: Since there are several of us experiencing what looks to be the same sort of deadlock issue, I beseech you not to give up I am not the only one reporting or having the issue. I think you have different issues. I agree. It looks like we have several issues floating around. There are some known issues with rpc.lockd (and probably some unknown ones) that will require a concerted effort to resolve. There appear to be a number of reports relating to this/these problems. It sounds like there is also an NFS client race condition or other bug of some sort. I think it would be really useful to isolate the two during debugging. Specifically, to make sure that the second client bug is reproduceable without rpc.lockd running on the client (and related mount flags). Once we have some more information, such as vnode locking information, client thread stack traces, etc, we should probably get Mohan in the loop if things seem sticky. I believe he was on vacation last week; he may be back this week sometime. With the July 4 weekend afoot, a lot of .us developers are offline. I too did noted some time ago that unresposible nfs server takes nfs client down. I then looked at the issue, and have the impression that this is again the case of runningbufspace depletion. I got a lot of processes in wdrain and flswai states. After nfs server repaired, active write requests were executed, number of dirty buffers decreased, and system returned to normal operation. This seems to be an architectural issue. I tried to bring discussion up several month ago, but got no response. And, there is the small problem about SIGINT being ignored when mounted with intr flag. Patch to fix this is attached in my previous mail. pgpJkB9m4Wicz.pgp Description: PGP signature
Re: NFS Locking Issue
On Mon, Jul 03, 2006 at 10:06:52AM +0100, Robert Watson wrote: It sounds like there is also an NFS client race condition or other bug of some sort. It may not be related, directly, but one thing that I noticed, while trying to sort out my own recently commissioned NFS setup, is that the -r1024 mount flag is *crucial* when the network is 100BaseT and the server is a new, fast amd64 box, and the client is an old P3-500 with a RealTek ethernet card. It works fine, now, but tcpdump showed that it was retrying forever without. Even NFS over TCP seemed to suffer a bunch of error-related retries which amounted to stalls in the client. Is there any way for this sort of thing to be adjusted automatically? Cheers, -- Andrew ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: NFS Locking Issue
So it would appear that you cured the NFS problems inherent with FBSD-6 by replacing FBSD with Fedora Linux. Nice to know that NFSd works in Linux. But won't help those on the FBSD list fix their FBSD-6 boxen. :/ First NFS is designed to make machines of different OSs interact properly. If a FreeBSD server interacts properly with a FreeBSD client, but not other clients, you cannot say that the situation is fine. Second i am not the one to chose the NFS server, there are people working in social groups, in the real world. And third, the most important, the OP message seemed to imply that the FreeBSD-6 NFS client was at fault, i pointed out that in my experience my FreeBSD-6.1 client works OK, while the 6.0 doesn't, when interacting with a FC5 server. This is in itself a relevant piece of information for the problem at hand. It may be that the server side is at fault, or some complex interaction between client and server. Anyways some people claimed here that they had no problem with FreeBSD-5 clients and servers. My experience is that i had constant problems between FreeBSD-5 clients and Fedora Core 3 servers. I cannot provide any other data point. I am not particularly sure of the quality of the FC3 or FC5 NFS server implementation, except that the ~ 100 workstations running the similar Fedora distribution work like a charm with their homes NFS mounted on the server. On the other hand a Debian client machine also has severe NFS problems. My only conclusion is that these NFS stories are very tricky. The only moment everything worked fine was when we were running Solaris on the server. -- Michel TALON ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: NFS Locking Issue
On Mon, 3 Jul 2006, Francisco Reyes wrote: Kostik Belousov writes: I think that then 6.2 and 6.3 is not for you either. Problems cannot be fixed until enough information is given. I am trying.. but so far only other users who are having the same problem are commenting on this and other simmilar threads. We just need some guidance.. Mark gave me a URL to turn on debugging and volunteered ot give me some pointers.. I will try, but I will likely try on my own time, on my own machines.. I can not tell the owner of the company I work for to let me try.. or play around in production machines.. as we loose customers because of current problems with the 6.X line. Since nobody except you experience that problems (at least, only you notified about the problem existence) Did you miss the part of: User Freebsd writes: Since there are several of us experiencing what looks to be the same sort of deadlock issue, I beseech you not to give up I am not the only one reporting or having the issue. Careful here, I think this is where things are getting confused ... the above is related to the deadlock (high vmstat blockd issue), not the NFS issue ... we're getting two different issues confused :) improved handling of signals in nfs client. If you could test it, that would be useful. Does it matter if the OS is i386 or am64? Have an amd64 machine I can more easily play with... with no risk to production. Does the amd64 machine exhibit the same problem? Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: NFS Locking Issue
Michel Talon wrote: [ ...a long email snipped... ] My only conclusion is that these NFS stories are very tricky. The only moment everything worked fine was when we were running Solaris on the server. I can't speak to the earlier part about NFS with Linux, but at least I very much agree with your conclusion: Solaris makes one of the best NFS servers available, over a broad range of use cases. However, I also wish to note that if you want to use NFS and you need remote locking to work, your best hope is when the software you use is willing to use explicit lockfiles rather than depending on rpc.lockd to provide remote flock()/lockf()-style locking. There are plenty of software out there which includes locking tests (sendmail does, UWash IMAP does, Perl does, etc), and my observation has been that actually using NFS-based remote locking under anything beyond trivial load tends to make rpc.lockd terminate within seconds (maybe with a core dump, if you get lucky), or end up with processes getting stuck forever waiting on locks that don't ever return because they've been lost somewhere in limbo. YMMV. :-) -- -Chuck ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Which FreeBSD is the most stable for Dell PowerEdge 2850
On Jun 30, 2006, at 8:08 PM, Dan Charrois wrote: In any case, the server is used heavily all year except July, so this is my time of year to take things apart, update software, etc. And so I'm wondering - what is the recommended version of FreeBSD I should be running if stability is of the utmost importance? Should I migrate to the 6.x stream? Is it relatively solid? Or should I stay with 5.4 for now? I've seen some messages posted periodically from various people running into problems, I don't have any 2850's but the 1850 I have has been running 6.0 since the BETA1, and last night just upgraded it to 6.1. No issues. The PERC 4e/Si card is phenominally fast on this system (running 2 disk RAID1). I'd recommend you to run 6.1 as it is stable on all of my Dell systems that run it (and I'm migrating the older FreeBSD boxes to 6.1 as time permits). If you already have 1 CPU, you might as well leave hyperthreading off. There are cases where it degenerates performance rather than enhance it. As for mysql version, no comment :-)
Re: NFS Locking Issue
At 9:13 PM -0400 7/1/06, Francisco Reyes wrote: John Hay writes: I only started to see the lockd problems when upgrading the server side to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x and 7-current and the lockd problem only showed up when upgrading the server from 5.x to 6.x. It confirms the same we are experiencing.. constant freezing/locking issues. I guess no more 6.X for us.. for the foreseable future.. I don't know if this will be of any help to anyone, but... I recently moved a network-based service from a 4.x machine to a 6.x machine. Despite some testing in advance of the switch, many people had problems with the service. I booted to a somewhat out-of-date snapshot of 5.x on the same box. I still had problems, but it didn't seem as bad, so I stuck with the 5.x system. Some problems turned out to be bugs in the service itself, and were eventually found and fixed. However, one set of problems on that out-of-date snapshot of 5.x were solved by adding: net.inet.tcp.rfc1323=0 to /etc/sysctl.conf. The guy who suggested that said it avoided a bug which was fixed in later versions of either 5.x or 6.x, I forget which. Of interest is that the bug was such that some people connecting to the service were never bothered by the bug, while other people could not use the service at all until I turned off tcp.rfc1323 . I have a test version of the same service running on a different FreeBSD/i386 box, and that box is now updated to freebsd-stable as of June 10th. Lo and behold, someone connecting to that test box reported some problems. So I typed in 'sysctl net.inet.tcp.rfc1323=0', and his problem immediately disappeared. So, it might be that there is still some problem with the rfc1323 processing, or that the bug which had been fixed has somehow been re-introduced. In any case, people who are experiencing problems with NFS might want to try that, and see if it makes any difference. It does strike me as odd that some people are having a *lot* of trouble with NFS under 6.x, while others seem to be okay with it. Perhaps the difference is the network topology between the NFS server and the NFS clients. Obviously, this is nothing but a guess on my part. I am not a networking guru! -- Garance Alistair Drosehn= [EMAIL PROTECTED] Senior Systems Programmer or [EMAIL PROTECTED] Rensselaer Polytechnic Instituteor [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: NFS Locking Issue
Garance A Drosihn wrote: At 9:13 PM -0400 7/1/06, Francisco Reyes wrote: John Hay writes: I only started to see the lockd problems when upgrading the server side to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x and 7-current and the lockd problem only showed up when upgrading the server from 5.x to 6.x. It confirms the same we are experiencing.. constant freezing/locking issues. I guess no more 6.X for us.. for the foreseable future.. I don't know if this will be of any help to anyone, but... I recently moved a network-based service from a 4.x machine to a 6.x machine. Despite some testing in advance of the switch, many people had problems with the service. I booted to a somewhat out-of-date snapshot of 5.x on the same box. I still had problems, but it didn't seem as bad, so I stuck with the 5.x system. Some problems turned out to be bugs in the service itself, and were eventually found and fixed. However, one set of problems on that out-of-date snapshot of 5.x were solved by adding: net.inet.tcp.rfc1323=0 to /etc/sysctl.conf. The guy who suggested that said it avoided a bug which was fixed in later versions of either 5.x or 6.x, I forget which. Of interest is that the bug was such that some people connecting to the service were never bothered by the bug, while other people could not use the service at all until I turned off tcp.rfc1323 . I have a test version of the same service running on a different FreeBSD/i386 box, and that box is now updated to freebsd-stable as of June 10th. Lo and behold, someone connecting to that test box reported some problems. So I typed in 'sysctl net.inet.tcp.rfc1323=0', and his problem immediately disappeared. So, it might be that there is still some problem with the rfc1323 processing, or that the bug which had been fixed has somehow been re-introduced. In any case, people who are experiencing problems with NFS might want to try that, and see if it makes any difference. It does strike me as odd that some people are having a *lot* of trouble with NFS under 6.x, while others seem to be okay with it. Perhaps the difference is the network topology between the NFS server and the NFS clients. Obviously, this is nothing but a guess on my part. I am not a networking guru! Thanks for the try Garance, but in my setup it didn't make any difference. I'll get into a bit more detail about my setup in another post. Later on, -- Michael Collette IT Manager TestEquity Inc [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: trap 12: supervisor write, page not present on 6.1-STABLE Tue May 16 2006
On Fri, Jun 30, 2006, Robert Watson wrote: Thanks for testing the patch -- it looks like there's a more pressing logical problem in this code! Could you try the following simpler patch: http://www.watson.org/~robert/freebsd/netperf/ip_ctloutput.diff The IP option code seems not to know that (in RELENG_6 and before) the pcb is discarded on disconnect, and the application is querying the TTL after a disconnect. In FreeBSD 7.x, the pcb is preserved after disconnect so this succeeds. I'm running with the patch applied for 3 days straight and the machine didn't crash once. Please, consider merging it to RELENG_6. pgpFABO0jK0gx.pgp Description: PGP signature
Re: NFS Locking Issue
User Freebsd wrote: On Sat, 1 Jul 2006, Francisco Reyes wrote: John Hay writes: I only started to see the lockd problems when upgrading the server side to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x and 7-current and the lockd problem only showed up when upgrading the server from 5.x to 6.x. It confirms the same we are experiencing.. constant freezing/locking issues. I guess no more 6.X for us.. for the foreseable future.. Since there are several of us experiencing what looks to be the same sort of deadlock issue, I beseech you not to give up Honestly trying not to. To tell ya the truth, I've been giving a real hard look at Ubuntu for my serving needs. This NFS thing has got me seriously questioning FreeBSD right at the moment. ... right now, all we've been able to get to the developers is virtually useless information (vmstat and such shows the problem, but it doesn't allow developers to identify the problem) ... Is this a problem that you can easily recreate, even on a non-production machine? Oh yeah. I've got a couple of ways I'm able to get this to fail. Method #1: - Let's start with the simplest. The scenario here involves 2 machines, mach01 and mach02. Both are running 6-STABLE, and both are running rpcbind, rpc.statd, and rpc.lockd. mach01 has exported /documents and mach02 is mounting that export under /mnt. Simple enough? The /documents directory has multiple subdirectories and files of various sizes. The actual amount of data doesn't really matter to produce a failure. All you need to do at this point is to try to copy files from that mount point to somewhere else on the hard drive. cp -Rp /mnt/* /tmp/documents/ You may, or not, see that a couple of subdirectories were created, but no files actually moved over. The cp command is now locked up, and no traffic moves. This usually takes a second or two to show up as a problem. I can repeat this with multiple 6-STABLE boxes. Turn off rpc.lockd on either the server or client before the cp command, and things work. Method #2: - Booting to a diskless work station. The server (mach01) has exported /usr, /usr/local, /usr/X11R6 and enough other stuff to get a diskless workstation up and running. Not going to get into all the details here other than to say that I have a fully functioning setup like this on 5.4 boxes now. I've knocked the boot up of the diskless client (mach02) down to console only. Once at the console I startx with a regular user, taking me in to twm. From there I try to launch a KDE application, which in my test case is kwrite. The same situation is true with launching a GTK app, such as Gimp. X and twm start up. I've got all the rest of the system reasonably functional. When I try to run kwrite, none of the KDE subsystems start up. kwrite just sits there in a lockd state. Same is true of Gimp. If I shutdown rpc.lockd on either machine I'm able to bring up a full KDE desktop, with all applications able to run. Other Testing: - At one point we had in our test network a 6.1 NFS server providing files to 5.4 diskless clients without any problems. We first got to noticing the bulk of the glitches when I moved the diskless setup to use a 6.1 kernel. As I said, I've been looking at Linux alternatives. Especially after reading about Michel Talon's experiences with Fedora. I initially tried CentOS, but wasn't able to get NFS working properly on that thing. I had an Ubuntu CD handy, so I installed it on a test box. Wow, does that NFS server boogie! Using Ubuntu as the server I connected a FreeBSD 5.4 and 6-stable box as clients on a 100Mb/s network. The time trial used a dummy 100Meg file transfered from the server to the client. We measured 90Mb/s transfer, which was FAR faster than I had ever been able to get 2 FreeBSD boxes to perform doing similar tests. I then used Ubuntu to connect to a 5.4 server we have in production. I don't recall the exact stats, but it was close to 10x slower. No lockups here though. After the 4th of July I intend to test Ubuntu as a client to a FreeBSD 6-STABLE server on a gigabit lan to run similar time trials. I'm looking to confirm what I can only suspect at this point, which is that the NFS server on FreeBSD is mucked up, but the client is okay. As time allows I hope to run similar tests between two Ubuntu boxes, then run it all again with Fedora. Seriously debating whether to move some or all of our infrastructure to Linux after all this. A 3-4 month old known bug like this gives me a great deal of concern about FreeBSD. That, and Ubuntu's NFS server speed just about knocked me over! In my case, I have one machine fully configured for debugging, but, of
Re: trap 12: supervisor write, page not present on 6.1-STABLE Tue May 16 2006
On Tue, 4 Jul 2006, Stanislaw Halik wrote: On Fri, Jun 30, 2006, Robert Watson wrote: Thanks for testing the patch -- it looks like there's a more pressing logical problem in this code! Could you try the following simpler patch: http://www.watson.org/~robert/freebsd/netperf/ip_ctloutput.diff The IP option code seems not to know that (in RELENG_6 and before) the pcb is discarded on disconnect, and the application is querying the TTL after a disconnect. In FreeBSD 7.x, the pcb is preserved after disconnect so this succeeds. I'm running with the patch applied for 3 days straight and the machine didn't crash once. Please, consider merging it to RELENG_6. I have committed this as ip_output.c:1.242.2.9 in the RELENG_6 branch, and will also merge to RELENG_5 in a few days. Assuming this settles well, I'll talk to the RE team about doing an errata patch for this in the RELENG_6_1 branch. Thanks! Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]