I am sending this out again as the customer has tested a fix I provided and determined that it does resolve their issue. I would like to move forward with a fix, and the one I am using now is the one mentioned below with the addition of signal(). If I can get some comments, I would appreciate it. Otherwise - I guess I will just proceed to put this into Solaris code.
Pete On 09/16/09 10:48, Peter Shoults wrote: > Hi, > > Customer has brought forward an issue they were having with Kerberos and > LDAP, where LDAP is being used to store the database information for > Kerberos. The issue is that if the LDAP server is restarted for any > reason, then Kerberos does not automatically resync back with the LDAP > server when the LDAP server is back up and running. Specifically, one > can run and login into kadmin, but any commands that are run will fail > with the error: > > "Communication failure with server while retrieving list." > > It turns out if the user exits from kadmin and logs back in a second > time, then the command do work fine. > > I have determined that the cause of this problem is that when the LDAP > server is restarted, all the connections we have on port 636 to the LDAP > server go into a CLOSE_WAIT/FIN_WAIT_2 state. When we log into kadmin, > we attempt to contact the LDAP server on these connections, and we > received SIGPIPE in response to our writes. Here is a snippet from truss: > > 3200/1: 57.2401 write(14, 0x0010B810, 23) > Err#32 EPIPE > 3200/1: 150301\012941A 60F Y P87A7BE9318B6 > c8C |0F v > 3200/1: 57.2404 Received signal #13, SIGPIPE [caught] > > This is fine - the sig_pipe handler is invoked and we do print out the > syslog message. However, we never reset the signal disposition for > SIGPIPE. kadmind process immediately proceeds to try the next > connection to the LDAP server, and again gets SIGPIPE. This time > though, the default handler is invoked, which terminates kadmind. At > this point, SMF realizes kadmind has died and restarts it, which > re-establishes all our connections to the LDAP server and that explains > why a subsequent login to kadmin will work. > > I have two questions about this. The first why do we have a handler for > SIGPIPE in the kadmin code, unlike the krb5kdc code, which sets SIGPIPE > disposition to SIG_IGNORE. This handler in the kadmin code has not > changed in a long long time. I tested setting SIGPIPE to SIG_IGN and > this does allow a user to enter commands into kadmin after LDAP server > restarts and run commands without issue. > > Assuming we have the SIGPIPE handler specifically to output the syslog > message, then I propose that we have in the handler a resetting of the > signal disposition to sig_pipe. I have also tested this fix and > verified that this also resolves the problem and allows the user to > enter kadmin commands after LDAP server restarts. Here is my change: > > file modified is ovsec_kadmd.c > > void > sig_pipe(int unused) > { > + signal(SIGPIPE, sig_pipe); > krb5_klog_syslog(LOG_NOTICE, gettext("Warning: Received a SIGPIPE; " > "probably a client aborted. Continuing.")); > } > > > Pete > >