RHEL support pointed me to a document suggesting this may be an implementation issue:
http://kbase.redhat.com/faq/docs/DOC-5935 "DNS is not a reliable way to get name resolution for the cluster. All cluster nodes must be defined in /etc/hosts with the name that matches cluster.conf and uname -n." But we use a local DNS service on all of our hosts (for years) and leave /etc/hosts alone with only the localhost entry in it. Our servers have multiple bonded NICs, so I put the "hosts" in their own private domain / IP address, and each clusternode entry in cluster.conf is simply added as: acropolis, cerberus, rycon, and solaria. I let DNS (and reverse DNS) resolve those names, i.e., /etc/resolve search blade ccc.cluster bidmc.harvard.edu bidn.caregroup.org nameserver 127.0.0.1 $ host acropolis acropolis.blade has address 192.168.2.1 $ host cerberus cerberus.blade has address 192.168.2.4 $ host rycon rycon.blade has address 192.168.2.11 $ host solaria solaria.blade has address 192.168.2.12 r...@acropolis [~]$ netstat -a | grep 6809 udp 0 0 acropolis.blade:6809 *:* udp 0 0 192.168.255.255:6809 *:* [r...@cerberus ~]# netstat -a | grep 6809 udp 0 0 cerberus.blade:6809 *:* udp 0 0 192.168.255.255:6809 *:* [r...@solaria ~]# netstat -a | grep 6809 udp 0 0 solaria.blade:6809 *:* udp 0 0 192.168.255.255:6809 *:* ... even though each of those server's $( uname -n ) has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them. Is this REALLY a cause for concern? If so, could this introduce a failure (if not at join) during some later event? Any feedback is welcome! On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote: > Simple 4-node cluster, 2-nodes have a GFS shared home directory > mounted for over a month. Today, I wanted to mount /home on a 3rd > node, so: > > # service fenced start [failed] > > Weird. Checking /var/log/messages show: > > Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10 > (built Jan 22 2009 18:39:16) installed > Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22 > 2009 18:39:32) installed > Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster > "lock_dlm", "ccc_cluster47:home" > Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18) > installed > Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found; > check fenced > Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm, > table = ccc_cluster47:home, hostdata = > > # cman_tool services > Service Name GID LID State > Code > Fence Domain: "default" 0 2 join > S-2,2,1 > [] > > So, a fenced process is now hung: > > root 28302 0.0 0.0 3668 192 ? Ss 10:19 0:00 fenced > -t 120 -w > > Q: Any idea how to "recover" from this state, without rebooting? > > The other two servers are unaffected by this (thankfully) and show > normal operations: > > $ cman_tool services > > Service Name GID LID State > Code > Fence Domain: "default" 2 2 run - > [1 12] > > DLM Lock Space: "home" 5 5 run - > [1 12] > > GFS Mount Group: "home" 6 6 run - > [1 12] > > > -- > Linux-cluster mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/linux-cluster
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
