[Linux-HA] V2 only Error - Filesystem fails, but V1 works??

Cliff White Fri, 23 May 2008 13:12:39 -0700

Having a very strange issue here.
Quite simple config - two servers, one shared disk device.
Failing over a Filesystem resource.


Using Centos release 5
heartbeat-stonith-2.1.3-3.el5.centos
heartbeat-gui-2.1.3-3.el5.centos
heartbeat-pils-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-devel-2.1.3-3.el5.centos
linux-2.6.18-53.1.13.el5_lustre.1.6.4.3-fail
(Centos kernel with Lustre patches)

The filesystem is Lustre, which has been tested quite a bit
with heartbeat V1.

The system is a 8 CPU box, with 32GB physical memory. It's in a testlab, it was running nothing other than Heartbeat.I've reproduced the error on two VMware instances running the sameversions of everything, so it's not machine-size related.


The config is about as simple as can be:

haresources
srv4 Filesystem::/dev/sdb1::/mnt/mdt::lustre

ha.cf is generic, eth0 is bcast heartbeat.
With HB V1, this config Just Works. When I run Filesystem by hand,
it Just Works.

Converting to V2, it fails.

Fails very wierd. The mount.lustre dies, the call chain fromldlm_bl_thread_start() leads to the kernel thread start and then to thekernel function do_fork()

------syslog-----------

May 22 16:46:58 d1_q_4 kernel: LDISKFS-fs: mounted filesystem withordered data mode.May 22 16:46:58 d1_q_4 kernel: LustreError:3699:0:(ldlm_lockd.c:1779:ldlm_bl_thread_start()) cannot start LDLMthread ldlm_bl_00: rc -513May 22 16:46:58 d1_q_4 kernel: LustreError:3699:0:(ldlm_resource.c:302:ldlm_namespace_new()) ldlm_get_ref failed: -513

-------------

'-513' is ERESTARTNOINTR which says fork() was interrupted by a signal.


The filesystem failover for Lustre _should_ be just like a Filesystem

failover for an ext3 volume, except the 'mount' command exec's'mount.lustre'

When I strace the Filesystem command i see:

25936 stat("/sbin/mount.lustre", {st_mode=S_IFREG|0755, st_size=25160,...}) = 0

.......[snip]

25937 open("/sys/block/sdc/queue/max_sectors_kb",O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

25937 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

25937 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,-1, 0) = 0x2aaaaaaac000

25937 write(3, "32767\n", 6)            = 6
25937 close(3)                          = 0
25937 munmap(0x2aaaaaaac000, 4096)      = 0

25937 mount("/dev/sdc", "/mnt/tom", "lustre", 0, "device=/dev/sdc") = -1ENOMEM (Cannot allocate memory)

25937 --- SIGTERM (Terminated) @ 0 (0) ---

----------------
So, on a system with 32GB of physical memory, running nothing other than

Heartbeat, i get ENOMEM? (This i would guess is really just the artifactfrom the bad fork() call.)

What gives here? What is different in the way V2 and V1 run theFilesystem script? Does V2 place some restriction/security around memoryor fork()?


I have tried:
- Converting a working V1 config to V2 with haresourestocib.py

- Starting a V2 config with the bare min from Alan's tutorial and addingresources via GUI

- Starting a V1 min config, converting to V2 with script

To repeat, if i remove 'crm yes' and run V1 style, it Just Works.
If I setup a V1 config, It Just Works.
If i invoke Filesystem by hand, it Just Works.

Any help _very_ much appreciated. I can furnish full strace or whateverelse is needed, this is a lab setting.

cliffw



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] V2 only Error - Filesystem fails, but V1 works??

Reply via email to