Having a very strange issue here.
Quite simple config - two servers, one shared disk device.
Failing over a Filesystem resource.
Using Centos release 5
heartbeat-stonith-2.1.3-3.el5.centos
heartbeat-gui-2.1.3-3.el5.centos
heartbeat-pils-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-devel-2.1.3-3.el5.centos
linux-2.6.18-53.1.13.el5_lustre.1.6.4.3-fail
(Centos kernel with Lustre patches)
The filesystem is Lustre, which has been tested quite a bit
with heartbeat V1.
The system is a 8 CPU box, with 32GB physical memory. It's in a test
lab, it was running nothing other than Heartbeat.
I've reproduced the error on two VMware instances running the same
versions of everything, so it's not machine-size related.
The config is about as simple as can be:
haresources
srv4 Filesystem::/dev/sdb1::/mnt/mdt::lustre
ha.cf is generic, eth0 is bcast heartbeat.
With HB V1, this config Just Works. When I run Filesystem by hand,
it Just Works.
Converting to V2, it fails.
Fails very wierd. The mount.lustre dies, the call chain from
ldlm_bl_thread_start() leads to the kernel thread start and then to the
kernel function do_fork()
------syslog-----------
May 22 16:46:58 d1_q_4 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
May 22 16:46:58 d1_q_4 kernel: LustreError:
3699:0:(ldlm_lockd.c:1779:ldlm_bl_thread_start()) cannot start LDLM
thread ldlm_bl_00: rc -513
May 22 16:46:58 d1_q_4 kernel: LustreError:
3699:0:(ldlm_resource.c:302:ldlm_namespace_new()) ldlm_get_ref failed: -513
-------------
'-513' is ERESTARTNOINTR which says fork() was interrupted by a signal.
The filesystem failover for Lustre _should_ be just like a Filesystem
failover for an ext3 volume, except the 'mount' command exec's
'mount.lustre'
When I strace the Filesystem command i see:
25936 stat("/sbin/mount.lustre", {st_mode=S_IFREG|0755, st_size=25160,
...}) = 0
.......[snip]
25937 open("/sys/block/sdc/queue/max_sectors_kb",
O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
25937 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
25937 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2aaaaaaac000
25937 write(3, "32767\n", 6) = 6
25937 close(3) = 0
25937 munmap(0x2aaaaaaac000, 4096) = 0
25937 mount("/dev/sdc", "/mnt/tom", "lustre", 0, "device=/dev/sdc") = -1
ENOMEM (Cannot allocate memory)
25937 --- SIGTERM (Terminated) @ 0 (0) ---
----------------
So, on a system with 32GB of physical memory, running nothing other than
Heartbeat, i get ENOMEM? (This i would guess is really just the artifact
from the bad fork() call.)
What gives here? What is different in the way V2 and V1 run the
Filesystem script? Does V2 place some restriction/security around memory
or fork()?
I have tried:
- Converting a working V1 config to V2 with haresourestocib.py
- Starting a V2 config with the bare min from Alan's tutorial and adding
resources via GUI
- Starting a V1 min config, converting to V2 with script
To repeat, if i remove 'crm yes' and run V1 style, it Just Works.
If I setup a V1 config, It Just Works.
If i invoke Filesystem by hand, it Just Works.
Any help _very_ much appreciated. I can furnish full strace or whatever
else is needed, this is a lab setting.
cliffw
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems