I think I found the cause (not related to us) but I am not sure.
When switching NCPs, nwam and network/location are refreshed. So, there
are two svc.startd that are wedged on a door() call with the pstack that
has been emailed around a few times. I checked the pstack of the nscd
daemon and there seems to be a deadlock (two threads with the same
following pstack)
bash-3.2# pstack `pgrep nscd`
[...]
----------------- lwp# 63 / thread# 63 --------------------
c52e2489 lwp_park (0, 0, 0)
c52d8ea3 mutex_lock_impl (81570a4, 0, 0, c52d8fd7, 6) + 163
c52d900a mutex_lock (81570a4, 6, c0f36c98, c501a0aa) + 3f
c501a128 rpc_fd_lock (8157008, 6, 0, c5013569) + 98
c5013587 clnt_dg_call (80cee08, 3, c503af6c, c0f36de0, c503b034,
c0f36df0) + 2f
c503a8c4 domatch (80c5dd0, c51d70d8, c0f373f0, 9, 80dd270) + 78
c503a1e3 __yp_match_cflookup (80c5dd0, c51d70d8, c0f373f0, 9, c0f36ef8,
c0f36efc) + 147
c51d65b8 _nss_nis_ypmatch (80c5dd0, c51d70d8, c0f373f0, c0f36ef8,
c0f36efc, 0) + 38
c51d666d _nss_nis_lookup (809de28, c0f37070, 1, c51d70d8, c0f373f0, 0) + 35
c51d3a29 getbyname (809de28, c0f37070, 0, c52d7d9d) + 109
0806a416 nss_search (c507d268, c4ff99a4, 4, c0f37070) + 762
c4ffe00b _switch_getipnodebyname_r (c0f373f0, 814b414, 814b428, 2120,
1a, 3) + 6b
c4ffcf6b _get_hostserv_inetnetdir_byname (80dd590, c0f371a0, c0f37178,
c4ff7ccd) + bc3
c4ff7d84 netdir_getbyname (80dd590, c0f371f8, c0f371f4, c5021fae) + c4
c5022161 _getclnthandle_timed (c0f373f0) + 1c1
c5022b8b __rpcb_findaddr_timed (186a4, 2, 80dd540, c0f373f0, c0f372fc,
0) + 2df
c5015dde clnt_tp_create_timed (c0f373f0, 186a4, 2, 80dd540, 0,
c507c000) + 3e
c50156c7 clnt_create_timed (c0f373f0, 186a4, 2, c5068b0c, 0, 0) + 18f
c5015529 clnt_create (c0f373f0, 186a4, 2, c5068b0c) + 29
c5037174 __yp_all_cflookup (80c5dd0, c51d7038, c0f37928, 0) + 1f8
c51d6a83 _nss_nis_do_all (809d1a8, c0f37ab0, c0f37d50, c526c72d) + 4f
c51d3472 getbymember (809d1a8, c0f37ab0, 0, 806a483) + 7a
0806a416 nss_search (0, 80695ec, 6, c0f37ab0) + 762
0806af9c nss_psearch (c0f37c90, 80000, c0f37c78, 8130748) + f0
0805d6f3 lookup_int (c0fbbca8, 0, 0, c0f37cb0) + efb
0805d8b4 nsc_lookup (c0fbbca8, 0, 10, c0f37cb0) + 18
0806faad lookup (c0fbbd38, c8, 0, 1) + 13d
08070049 switcher (deadbeed, c0fbbd38, c8, 0, 0, 806fe3c) + 20d
c52e7ff0 __door_return () + 60
There was a change made to nscd in snv_127 (our gate is synced with
this) for the fix to
6863709 nscd dumps core after receiving SIGHUP
I haven't done a thorough read of the evaluation, but I wonder if this
is root cause.
Anurag