Having a very strange issue here.
Quite simple config - two servers, one shared disk device.
Failing over a Filesystem resource.

Using Centos release 5
heartbeat-stonith-2.1.3-3.el5.centos
heartbeat-gui-2.1.3-3.el5.centos
heartbeat-pils-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-devel-2.1.3-3.el5.centos
linux-2.6.18-53.1.13.el5_lustre.1.6.4.3-fail
(Centos kernel with Lustre patches)

The filesystem is Lustre, which has been tested quite a bit
with heartbeat V1.

The system is a 8 CPU box, with 32GB physical memory. It's in a test lab, it was running nothing other than Heartbeat. I've reproduced the error on two VMware instances running the same versions of everything, so it's not machine-size related.

The config is about as simple as can be:

haresources
srv4 Filesystem::/dev/sdb1::/mnt/mdt::lustre

ha.cf is generic, eth0 is bcast heartbeat.
With HB V1, this config Just Works. When I run Filesystem by hand,
it Just Works.

Converting to V2, it fails.

Fails very wierd. The mount.lustre dies, the call chain from ldlm_bl_thread_start() leads to the kernel thread start and then to the kernel function do_fork()
------syslog-----------
May 22 16:46:58 d1_q_4 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. May 22 16:46:58 d1_q_4 kernel: LustreError: 3699:0:(ldlm_lockd.c:1779:ldlm_bl_thread_start()) cannot start LDLM thread ldlm_bl_00: rc -513 May 22 16:46:58 d1_q_4 kernel: LustreError: 3699:0:(ldlm_resource.c:302:ldlm_namespace_new()) ldlm_get_ref failed: -513
-------------

'-513' is ERESTARTNOINTR which says fork() was interrupted by a signal.


The filesystem failover for Lustre _should_ be just like a Filesystem
failover for an ext3 volume, except the 'mount' command exec's 'mount.lustre'
When I strace the Filesystem command i see:
25936 stat("/sbin/mount.lustre", {st_mode=S_IFREG|0755, st_size=25160, ...}) = 0
.......[snip]
25937 open("/sys/block/sdc/queue/max_sectors_kb", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
25937 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
25937 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
25937 write(3, "32767\n", 6)            = 6
25937 close(3)                          = 0
25937 munmap(0x2aaaaaaac000, 4096)      = 0
25937 mount("/dev/sdc", "/mnt/tom", "lustre", 0, "device=/dev/sdc") = -1 ENOMEM (Cannot allocate memory)
25937 --- SIGTERM (Terminated) @ 0 (0) ---

----------------
So, on a system with 32GB of physical memory, running nothing other than
Heartbeat, i get ENOMEM? (This i would guess is really just the artifact from the bad fork() call.)

What gives here? What is different in the way V2 and V1 run the Filesystem script? Does V2 place some restriction/security around memory or fork()?

I have tried:
- Converting a working V1 config to V2 with haresourestocib.py
- Starting a V2 config with the bare min from Alan's tutorial and adding resources via GUI
- Starting a V1 min config, converting to V2 with script

To repeat, if i remove 'crm yes' and run V1 style, it Just Works.
If I setup a V1 config, It Just Works.
If i invoke Filesystem by hand, it Just Works.

Any help _very_ much appreciated. I can furnish full strace or whatever else is needed, this is a lab setting.
cliffw



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to