On Fri, Apr 20, 2012 at 10:30 AM, Andrew Deason <[email protected]>wrote:

> On Thu, 19 Apr 2012 18:55:08 -0700
> Ken Elkabany <[email protected]> wrote:
>
> > We have 2 OpenAFS servers running 1.4.14. We have many clients that we
> > just switched over to 1.6.1pre1. Starting earlier today, we started
> > getting NULL pointer dereferences, which has been completely hosing
> > the clients. The client machines hang on any call that deals with AFS,
> > whether it's "ls /", "ls /afs", "klist", etc...
>
> 'klist' shouldn't touch AFS...
>

I agree. Perhaps I remember incorrectly, though the whole filesystem seemed
to have been in a very bad state (hence hanging on "ls /").


>
> > A "vos changeaddr" was done earlier today, whereby a large collection
> > (4000) of volumes were mistakenly assigned to another server. These
> > were corrected with "vos syncvldb" followed by "vos syncserv". I
> > mention it here, as it's the only thing we've done to the AFS cluster
> > today.
>
> I hope 'vos changeaddr' is not something you run very often. You should
> almost never run that command.
>
>
Not at all. We have a fileserver that has two IPs that map to one of its
interfaces; one accessible via LAN, one accessible via WAN. We wanted all
accesses to be done through the local IP so we used changeaddr.
Unfortunately, we pointed all the volumes to a different, incorrect server.

I'm also not sure if just running syncvldb/syncserv will entirely fix
> that; changeaddr has screwed up server entries pretty bad in the past. I
> wouldn't say you're out of the woods until you're still fine after a
> fileserver restart.
>
> It looks like it didn't. Due to some combination of changeaddr, syncvldb,
and syncserv, it looks like some volumes had 2 identical RO entries. After
removing all the RO entries for volumes that exhibited this issue, the
segfaulting went away. This was a lengthy operation as we have 4000+
volumes each with read only copies.


> > Here's what we found in the syslog:
> >
> > Apr 20 01:30:43 SERVER kernel: [12861236.027818] BUG: unable to handle
> > kernel NULL pointer dereference at 0000000000000028
> > Apr 20 01:30:43 SERVER kernel: [12861236.027836] IP: [<ffffffffa0048087>]
> > afs_Conn+0x1e7/0x260 [openafs]
>
> 1.6.1pre1 is not a great version to be running, but I can't think of
> something that's been fixed since then that would address this. If
> someone else has some idea, feel free to say otherwise.
>
> We did an early upgrade to the daily build of Ubuntu Precise, which had
1.6.1pre1. We just saw the update to 1.6.1-1 from two days ago (thanks
Sergio), so we'll be upgrading asap.


> That offset suggests dereferencing sa_flags for a NULL srvAddr. The only
> place I think that can happen is:
>
>    /* First is always lowest rank, if it's up */
>    if ((tv->status[0] == not_busy) && tv->serverHost[0]
>        && !(tv->serverHost[0]->addr->sa_flags & SRVR_ISDOWN) &&
>        !(((areq->idleError > 0) || (areq->tokenError > 0))
>          && (areq->skipserver[0] == 1)))
>
> so it suggests that we have a vol struct with a server that has 0
> srvAddrs attached. I think maybe this is possible if we had another
> server struct 'steal' the srvAddr away, so we removed the srvAddr and
> we're left with none.
>
> One guess at what happened:
>
>  - something tries to look up a volume A, and knows that it's on a
>   server with IP address X
>  - 'vos changeaddr' created a new non-mh server entry for IP X
>  - we look up some other volume B, see it's on IP X, but we make a new
>   server struct for the non-mh server, and steal X away from the other
>   mh server struct
>  - we need to look up volume A again, and serverHost[0] is pointing to a
>   server that used to have the srvAddr for IP X, but since it was
>   reassigned, the first srvAddr is now NULL.
>
> I'm not exactly sure what we'd want to happen in that situation. The
> quick fix is to just check for the NULL srvAddr, but I'm not immediately
> sure if that would cause us to keep trying to use a volume struct with
> no reachable servers until the vol entry expires.
>
> I (or someone) would need to experiment a bit to verify that. However,
> if something like the above is the case, this should never happen unless
> you make server identity/numbering mistakes like that. If someone else
> wants to look, I would guess this is reproducible by accessing a vol,
> changing the server uuid, accessing a different vol on the same server
> and then accessing the first vol again. Or something like that.
>
>
Thank you for taking the time to think about this. Once again, after
deleting the duplicate RO entries the problem disappeared. If you need any
help in reproducing this issue, do let me know. However, since it happened
on our live production systems, we aren't keen on making it happen again.


>  --
> Andrew Deason
> [email protected]
>
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

Reply via email to