Steve Sizemore wrote:
> Thanks for the explanation. If I were a programmer, it would be very
> useful. As it is, it's still interesting. I have no way of judging the
> quality of the code in question, other than the empirical result that
> it works in most cases.

Well, then you are stuck with the code you have that someone else
wrote.  Hopefully that's not your problem, or your are in trouble.
8-).


> > > What is the result of running this locally on the NFS server and
> > > attempting to lock the underlying file?  If rpc.lockd is hanging onto a
> > > lock, running that perl script locally on the actual file (not an NFS
> > > mounted image of it) should also hang.
> >
> > That was my next question, as well: does it happen on a local FS
> > as well as an NFS FS?  Personally, I would *NOT* recommend running
> > it on the server, but mount a local FS on the client instead; the
> > less variables, the better.
> 
> Works fine on the "client" on a local file system. Works fine on the
> server.

OK, then it isn't an intra-program deadlock, which is something.

It could still be inter-program, but if it is, it's not going to
be easy to find; you will need to find someone who *is* a programmer.
FWIW, this happen when:

        Program 1       Program 2
        LOCK A
                        LOCK B
        LOCK B (Waiting for Program 2)
                        LOCK A (Waiting for Program 1 waiting for me)

> > On the other hand, this is clearly a deadlock that requires an
> > existing, conflicting lock -- IFF the you are correct about the
> > delayed locking behaviour.
> 
> Not sure I understand this.

If someone didn't already have it locks, your lock which waits for
the region to be able to lock it would not need to wait: it would
just give you the lock, and you wouldn't have the problem.


> > Does the failure occur with the same values in all cases in the
> > F_RSETLKW?  If so, I suggest you capture *all* locking packets on
> > your wire, and then find who is conflicting.  This may be a simple
> > lock order reversal (deadly embrace deadlock) due to poor application
> > performance.  You may also find that you have multiple process IDs,
> > when it should be a single process ID, for the proxy PID for the
> > conflicting request.  At worst, it would be nice to know the system
> > that caused it.
> >
> > Actually, for a lock you know is threre, you *can* diagnose the
> > problem (somewhat) by writing a program on the server, and using
> > F_GETLK on the range for the hanging lock on the server -- this
> > will return a struct flock, which will give you range and PID
> > information.  Do it on the Solaris box, though.
> >
> > The reason you want to do this on the Solaris box is that the
> > struct flock on FreeBSD fails to include the l_rsysid -- the
> > remote system ID.
> 
> Sorry, but I don't understand any of that.

You need to find out why it's waiting.  If it's waiting, it's
waiting for somebody.  You need to know who that somebody is.

Once you know that, you can go hit them over the head with a
large baseball bat.  8-).

I have attached the program to run on your Solaris box.  You
may have to look in /usr/include/sys/fcntl.h to see the right
name, if it complains about l_rsysid (might be l_sysid, or whatever).


> > Actually, given this, I don't understand how FreeBSD server side
> > proxy locking can actually work at all; it would incorrectly
> > coelesce locks with local locks when the l_pid matched, which
> > would be *all* locks in the lockd, and then incorrectly release
> > them when a local process exited, or any process on any remote
> > system unlocked an overlapping range (possibly in error).
> 
> So you're suggesting that when it works, it's just lucky? But others
> have said that it works for them, and it seems to work OK between
> FreeBSD systems.

I would have to look at the locking code in FreeBSD for the NFS
case.  I wrote some NFS locking code for FreeBSD in 1995 that was
not used for the implementation.

There are ways around the problem in userspace, but they're very
hard to make efficient or get correct.  They also make it very hard
to debug easily, because you can't get the system ID for systems
that have outstanding locks.  8-(.


> > You are using FreeBSD as the NFS client in this case, right?  If
> > so, that's probably not an issue for you...
> 
> No.
> 
> I think that you may be trying to solve a problem I don't have.
> First - I'm not a programmer. I'm not trying to write any program
> at all, except as necessary to diagnose this problem. I'll summarize
> the situation briefly. The issue cropped up in a commercial program
> (Xinet) which was working on Solaris 2.6 client and server. I'm
> replacing the server with a FreeBSD box (RELENG_5_0) and the program
> stopped working. Xinet tech support diagnosed it as nfs locking
> problem, which I've confirmed by my simple perl program.
> 
>         Client          Server          Result
>         ======          =======         ======
>         Solaris         Solaris         Works
>         FreeBSD         Solaris         Works
>         FreeBSD         FreeBSD         Works
>         Solaris         FreeBSD         Problems
> 
> Actually, when I say "works", all I know is that it doesn't hang.
> Whether or not the lock is actually effective, I haven't tested.
> Oh, and the nonblocking flock also hangs, just like the blocking one.
> The lock call returns; the unlock call doesn't.

I'm attaching a test program to run on the server when the
lock fails, using information from the trace to know the name
of the file to enter, and the ethreal decoded packet trace to
know how to answer the other questions.

But I think it may be as simple as you not telling us that you
have multiple IP addresses configured on one of your machines?

If so, try:

        sysctl -w net.inet.ip.check_interface=0

-- Terry
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>


int
main( ac, av)
int     ac;
char    *av[];
{
        int             fd;
        char            fname[ 80];
        char            scratch[ 80];
        struct flock    flock;

        printf( "Enter file name: ");
        gets( fname);   /* warning: the linker is a whiny bitch */

        printf( "Enter lock start: ");
        gets( scratch); /* warning: this program will never be suid */
        flock.l_start = atoi( scratch);

        printf( "Enter lock length: ");
        gets( scratch); /* warning: programmers are smarter than linkers */
        flock.l_len = atoi( scratch);

        printf( "Enter Whence: ");
        gets( scratch);
        flock.l_whence = atoi( scratch);

        printf( "Enter 1 for F_WRLCK, 0 for F_RDLCK: ");
        gets( scratch);
        flock.l_type = ( atoi( scratch) ? F_WRLCK : F_RDLCK);

        if( (fd = open( fname, O_RDWR, 0)) == -1) {
                perror( "Can't open file!");
                exit( 1);
        }

        if( fcntl( fd, F_GETLK, &flock) == -1) {
                perror( "fcntl failure");
                exit( 1);
        }

        close( fd);

        if( flock.l_type == F_UNLCK) {
                printf( "There is no lock that would block you!\n");
        } else {
                printf( "Blocking lock of type: %s\n",
                        (flock.l_type == F_WRLCK) ? "F_WRLCK" : "F_RDLCK");
                printf( "Start: %d\n", flock.l_start);
                printf( "Length: %d\n", flock.l_len);
                printf( "Whence: %d (expect 0)\n", flock.l_whence);
#ifndef __FreeBSD__
                printf( "System: %d\n", flock.l_rsysid);
#endif
        }

        exit( 0);
}

Reply via email to