We were doing an expiration run this past weekend, and had a bit of weirdness/failure in replication. The job started at 3:30am, and for about 40 minutes, incremental update requests were refused due to the load:
Oct 15 03:31:32 halfy kadmind[2935]: Request: iprop_get_updates_1, UPDATE_BUSY; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 [...] Oct 15 04:10:00 halfy kadmind[2935]: Request: iprop_get_updates_1, UPDATE_BUSY; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 Finally, the slave decided it wanted to do a full sync: Oct 15 04:15:00 halfy kadmind[2935]: Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 Oct 15 04:15:00 halfy kadmind[2935]: Request: iprop_full_resync_1, spawned resync process 23469, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 but then the child process *also* spawned a child process: Oct 15 04:17:34 halfy kadmind[23469]: Request: iprop_full_resync_1, spawned resync process 23901, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 which spawned a child process: Oct 15 04:25:17 halfy kadmind[23901]: Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 Oct 15 04:25:17 halfy kadmind[23901]: Request: iprop_full_resync_1, spawned resync process 24637, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 which *finally* did the transfer: Oct 15 04:27:51 halfy kadmind[24637]: Request: iprop_get_updates_1, UPDATE_FULL_RESYNC_NEEDED; Incoming SerialNo=76369; Outgoing SerialNo=N/A, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 There was a similar chain of processes for the other slave server, repeating a number of times, so in the end there were like 20 kadmin processes running, and it looks they they were all serving requests: Oct 15 05:39:54 halfy kadmind[24637]: Request: iprop_get_updates_1, UPDATE_OK; Incoming SerialNo=89261; Outgoing SerialNo=89263, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.10 Oct 15 05:40:30 halfy kadmind[24293]: Request: iprop_get_updates_1, UPDATE_OK; Incoming SerialNo=89261; Outgoing SerialNo=89263, success, client=kiprop/[email protected], service=kiprop/[email protected], addr=134.71.247.11 Oct 15 05:54:31 halfy kadmind[25049]: Request: kadm5_chpass_principal, [email protected], success, [email protected], service=kadmin/[email protected], addr=134.71.247.23 I think this failure had something to do with running out of file descriptors: Oct 15 03:30:02 halfy kadmind[2935]: dropping tcp fd 32 from 134.71.247.23.47346 Oct 15 03:30:02 halfy kadmind[2935]: dropping tcp fd 32 from 134.71.247.23.47346 Oct 15 04:20:09 halfy kadmind[23907]: too many connections Oct 15 04:20:09 halfy kadmind[23907]: too many connections Oct 15 04:20:09 halfy kadmind[23907]: dropping tcp fd 34 from 134.71.247.23.46232 Oct 15 04:20:09 halfy kadmind[23907]: dropping tcp fd 34 from 134.71.247.23.46232 Oct 15 04:22:43 halfy kadmind[24274]: too many connections Oct 15 04:22:43 halfy kadmind[24274]: too many connections Oct 15 04:22:43 halfy kadmind[24274]: dropping tcp fd 17 from 134.71.247.23.46777 Oct 15 04:22:43 halfy kadmind[24274]: dropping tcp fd 17 from 134.71.247.23.46777 Oct 15 04:25:17 halfy kadmind[23907]: too many connections Oct 15 04:25:17 halfy kadmind[23907]: too many connections Oct 15 04:25:17 halfy kadmind[23907]: dropping tcp fd 17 from 134.71.247.23.46777 Oct 15 04:25:17 halfy kadmind[23907]: dropping tcp fd 17 from 134.71.247.23.46777 Oct 15 04:27:51 halfy kadmind[24640]: too many connections Oct 15 04:27:51 halfy kadmind[24640]: too many connections Oct 15 04:27:51 halfy kadmind[24640]: dropping tcp fd 14 from 134.71.247.23.41273 Oct 15 04:27:51 halfy kadmind[24640]: dropping tcp fd 14 from 134.71.247.23.41273 Oct 15 04:29:40 halfy kadmind[24274]: too many connections Oct 15 04:29:40 halfy kadmind[24274]: too many connections Oct 15 04:29:40 halfy kadmind[24274]: dropping tcp fd 14 from 134.71.247.23.41273 Oct 15 04:29:40 halfy kadmind[24274]: dropping tcp fd 14 from 134.71.247.23.41273 Oct 15 04:30:25 halfy kadmind[24268]: too many connections Oct 15 04:30:25 halfy kadmind[24268]: too many connections Oct 15 04:30:25 halfy kadmind[24268]: dropping tcp fd 34 from 134.71.247.23.46232 Oct 15 04:30:25 halfy kadmind[24268]: dropping tcp fd 34 from I ended up having to kill all the kadmin processes and restart the server. I don't think there is ever supposed to be more than 1 process working, I hope having all of them mucking around at the same time didn't screw up my database :(. I've tracked down the root cause of too many connections to the kadmin server, but this seems like a separate bug? We're running 1.9.1. We did another batch job the following day, which had a similar pattern of refusing incrementals and then requesting a full resync, but it worked fine. Given I'd just restarted everything it wasn't having the too many connections issue, which is why I think the first failure is related to that... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | [email protected] California State Polytechnic University | Pomona CA 91768 ________________________________________________ Kerberos mailing list [email protected] https://mailman.mit.edu/mailman/listinfo/kerberos
