Hi Jose, you are my hero, exchanging the kernel fixed my problem. Thanks a lot
Sebastian "José Costa" <[EMAIL PROTECTED]> wrote: > That kernel is bugged. Use the Kernel of the day SP1 branch. > > ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH > > ------------------------------------------------------------- > > Hi list, > > I am struggling since weeks to get a linux-ha cluster running, managing some > ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem. > > I tried with a disk based heartbeat, without linux-ha running, to make sure > that ocfs2 is working as expected, but unfortunately it is not. > > when I mount the ocfs2 partition on the first host, everything is fine. I can > make an ls /mnt (where the ocfs2 partition is mounted), and see the directory > listing. > > I can mount the partition on the second host. running the mount command, shows > me the partition mounted, also mounted.ocfs2 shows the partition mounted on > both hosts. but a ls /mnt hangs forever. the ls command is also not killable > via kill -9 and an umount is also impossible. > > here is my cluster.conf file > node: > ip_port = 7777 > ip_address = 192.168.102.31 > number = 0 > name = ppsnfs101 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 192.168.102.32 > number = 1 > name = ppsnfs102 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > > both hosts are reachable via the network. The OCFS partitions are on a SAN, > mounted by both hosts, but I also tried to use iscsi, but with the same > result. > > > I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with a > kernel: 2.6.16.27-0.6-smp: > > ocfs2-tools-1.2.2-0.2 > ocfs2console-1.2.2-0.2 > > while all this happens, I see the following messages in the logs: > this is the log from the first host mounting this device: > > Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, > slot 1) > Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) > at 192.168.102.31:7777 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain > CAD397436504401B86AA79A8BCAE88D4 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101 > (num 0) at 192.168.102.31:7777 > > > > > this is the messages, from the second host mounting the device: > Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server > Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present > Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present > Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node > ppsnfs102 (num 1) at 192.168.102.32:7777 > Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT > 2006 (build sles) > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting. Commit interval 5 > seconds > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, > slot 0) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1) > at 192.168.102.32:7777 has been idle for 10 seconds, shuttin > g it down. > Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some > times that might help debug the situation: (tmr 1171878885.425 > 241 now 1171878895.425628 dr 1171878890.425647 adv > 1171878890.425655:1171878890.425657 func (573d7565:505) > 1171878889.164179:1171878889.16 > 4185) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102 > (num 1) at 192.168.102.32:7777 > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: > link to 1 went down! > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: > status = -107 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 > ERROR: node down! 1 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 > Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 > CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at > least one node (1) torecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 > CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor > ecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 > CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must > master $RECOVERY lock now > > > I am clueless at this point, no idea why it fails. If there is anybody who can > enlighten me, it would really be appreciated. > > kind regards > Sebastian > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > -- Sebastian Reitenbach Tel.: ++49-(0)3381-8904-451 RapidEye AG Fax: ++49-(0)3381-8904-101 Molkenmarkt 30 e-mail:[EMAIL PROTECTED] D-14776 Brandenburg web:http://www.rapideye.de _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
