I have attempted to debug the kernel panic that I reported on this list last 
week, which has been reported by several others as well.  The panic happens 
when DRBD is used in clusters based on corosync (either RHCS or Pacemaker), but 
only when those clusters are configured with multiple heartbeats (i.e., with 
"altname" specifications for the cluster nodes).  The panic appears to be 
caused by two defects, one in the distributed lock manager (DLM, used by 
corosync) and one in the SCTP network protocol (which is used in clusters with 
multiple heartbeats).  DRBD code triggers the panic but appears to be blameless 
for it.

Disclaimer:  I am not a Linux kernel expert; all of my kernel debugging 
expertise is on a different flavor of Unix.  My assumptions or conclusions may 
be incorrect; I do not guarantee 100% accuracy of this analysis.  Caveat lector.

Environment:  As will be clear from the analysis below, this defect can 
manifest in many ways.  I debugged a particular manifestation that occurred 
with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64 (i.e., RHEL/CentOS 
6.0).  The manifestation I debugged was running a two node cluster, shutting 
down node A and starting it back up.  Node B panics as soon as Node A starts 
back up.  (See my previous mail for the defect signature.)

When the cluster starts up, it creates a DLM "lockspace".  This causes the DLM 
code to create a socket for communication with the other nodes.  Since we're 
configured for multiple heartbeats, it's an SCTP socket.  DLM also creates a 
bunch of new kernel threads, among which is the dlm_recv thread, which listens 
for traffic on that socket.  (Actually I see two of them, one per CPU.)  You 
can see this in a "ps" listing.

An important thing to note here is that all kernel threads are part of the same 
pseudo-process, and as such, they all share the same set of file descriptors.  
However, kernel threads do not normally (ever?) use file descriptors; they tend 
to work with file structures directly.  The SCTP socket created above, for 
example, has the appropriate in-kernel socket structure, file structure, and 
inode structure, but it does not have a file descriptor.  That's as it should 
be.

When node A starts back up, the SCTP protocol notices this (as it's supposed 
to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART notification to the SCTP 
socket, telling the socket owner (the dlm_recv thread) that the other node has 
restarted.  DLM responds by telling SCTP to create a clone of the master 
socket, for use in communicating with the newly restarted node.  (This is an 
SCTP_SOCKOPT_PEELOFF request.)  And this is where things go wrong: the 
SCTP_SOCKOPT_PEELOFF request is designed to be called from user space, not from 
a kernel thread, and so it does allocate a file descriptor for the new socket.  
Since DLM is calling it from a kernel thread, the kernel thread now has an open 
file descriptor (#0) to that socket.  And since kernel threads share the same 
file descriptor, every kernel thread on the system has this open descriptor.  
So defect #1 is that DLM is calling an SCTP user-space interface from a kernel 
thread, which results in pollution of the kernel thread file descriptor table.

Meanwhile, DRBD has its own in-kernel code, running in a different kernel 
thread.  And it detects (I didn't bother to track down how) that its peer is 
back online.  DRBD allows the user to configure handlers for events like that: 
user space programs that should be called when such an event occurs.  So when 
DRBD notices that its peer is back, its kernel thread uses call_userhelper() to 
start a user-space instance of drbdadm to invoke any appropriate handlers.  
This is the invocation of drbdadm that we see in the panic report.  (drbdadm 
gets invoked this way in response to a number of other possible events, as 
well, so this panic can manifest itself in other ways.)

The key thing about this instance of drbdadm is that it was invoked by a kernel 
thread.  Therefore it shouldn't have any open file descriptors - but in this 
case, it does: it inherits fd 0 pointing to the SCTP socket.  One of the first 
things that drbdadm does, when starting up, is call isatty(stdin) to find out 
how it should format its output.  If it were called from user space, that would 
correctly check whether standard input was interactive.  If it were called 
correctly from a kernel thread, there would be no stdin and it would correctly 
return an error.  But what actually happens is that it calls isatty on the SCTP 
socket that is (incorrectly) in file descriptor 0.

When ioctl is called on a socket, the sock_ioctl() function dereferences the 
socket data structure pointer (sk).  Defect #2 is that the offending socket in 
this case has a null sk pointer.  (I did not track down why, but presumably 
it's a problem with the SCTP peel-off code.)  So when sock_ioctl() derefences 
the pointer, the kernel panics.

So, to recap:  this panic occurs because (a) the drbdadm process is erroneously 
given an SCTP socket as its standard input, and (b) that socket's data pointer 
is null, so it panics when drbdadm (reasonably) makes an ioctl call on its 
standard input.

If you need a workaround for this panic, the best I can offer is to remove the 
"altname" specifications from the cluster configuration, set <totem 
rrp_mode="none"> and <dlm protocol="tcp">, so that corosync uses TCP sockets 
instead of SCTP sockets.

Regards,
Steven Roth
Hewlett-Packard Company

P.S.  Some readers of this mailing list may be frustrated by the lack of useful 
response from DRBD engineers.  I'd like to point out that the use of multiple 
heartbeats is a critical part of this defect scenario that was not mentioned in 
any of the panic reports (including mine).  I don't know if they tried, but 
DRBD engineers were not given sufficient information to reproduce the problem.

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to