On Fri, May 23, 2008 at 9:42 PM, Cliff White <[EMAIL PROTECTED]> wrote:
> Having a very strange issue here.
> Quite simple config - two servers, one shared disk device.
> Failing over a Filesystem resource.
>
> Using Centos release 5
> heartbeat-stonith-2.1.3-3.el5.centos
> heartbeat-gui-2.1.3-3.el5.centos
> heartbeat-pils-2.1.3-3.el5.centos
> heartbeat-2.1.3-3.el5.centos
> heartbeat-devel-2.1.3-3.el5.centos
> linux-2.6.18-53.1.13.el5_lustre.1.6.4.3-fail
> (Centos kernel with Lustre patches)
>
> The filesystem is Lustre, which has been tested quite a bit
> with heartbeat V1.
>
> The system is a 8 CPU box, with 32GB physical memory. It's in a test lab, it
> was running nothing other than Heartbeat.
> I've reproduced the error on two VMware instances running the same versions
> of everything, so it's not machine-size related.
>
> The config is about as simple as can be:
>
> haresources
> srv4 Filesystem::/dev/sdb1::/mnt/mdt::lustre
>
> ha.cf is generic, eth0 is bcast heartbeat.
> With HB V1, this config Just Works. When I run Filesystem by hand,
> it Just Works.
>
> Converting to V2, it fails.
>
> Fails very wierd. The mount.lustre dies, the call chain from
> ldlm_bl_thread_start() leads to the kernel thread start and then to the
> kernel function do_fork()
> ------syslog-----------
> May 22 16:46:58 d1_q_4 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> May 22 16:46:58 d1_q_4 kernel: LustreError:
> 3699:0:(ldlm_lockd.c:1779:ldlm_bl_thread_start()) cannot start LDLM thread
> ldlm_bl_00: rc -513
> May 22 16:46:58 d1_q_4 kernel: LustreError:
> 3699:0:(ldlm_resource.c:302:ldlm_namespace_new()) ldlm_get_ref failed: -513
> -------------
>
> '-513' is ERESTARTNOINTR which says fork() was interrupted by a signal.
>
>
> The filesystem failover for Lustre _should_ be just like a Filesystem
> failover for an ext3 volume, except the 'mount' command exec's
> 'mount.lustre'
> When I strace the Filesystem command i see:
> 25936 stat("/sbin/mount.lustre", {st_mode=S_IFREG|0755, st_size=25160, ...})
> = 0
> .......[snip]
> 25937 open("/sys/block/sdc/queue/max_sectors_kb", O_WRONLY|O_CREAT|O_TRUNC,
> 0666) = 3
> 25937 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> 25937 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x2aaaaaaac000
> 25937 write(3, "32767\n", 6)            = 6
> 25937 close(3)                          = 0
> 25937 munmap(0x2aaaaaaac000, 4096)      = 0
> 25937 mount("/dev/sdc", "/mnt/tom", "lustre", 0, "device=/dev/sdc") = -1
> ENOMEM (Cannot allocate memory)
> 25937 --- SIGTERM (Terminated) @ 0 (0) ---
>
> ----------------
> So, on a system with 32GB of physical memory, running nothing other than
> Heartbeat, i get ENOMEM? (This i would guess is really just the artifact
> from the bad fork() call.)
>
> What gives here? What is different in the way V2 and V1 run the Filesystem
> script? Does V2 place some restriction/security around memory or fork()?
>
> I have tried:
> - Converting a working V1 config to V2 with haresourestocib.py
> - Starting a V2 config with the bare min from Alan's tutorial and adding
> resources via GUI
> - Starting a V1 min config, converting to V2 with script
>
> To repeat, if i remove 'crm yes' and run V1 style, it Just Works.
> If I setup a V1 config, It Just Works.
> If i invoke Filesystem by hand, it Just Works.
>
> Any help _very_ much appreciated. I can furnish full strace or whatever else
> is needed, this is a lab setting.

The V1 script parses a few parameter and ends up calling the V2 script.
So it's probably worth using "env" to log the script's parameters
and/or "set -x" to trace what the script is doing differently.

Also, using hb_report to grab all the logs and configuration would
probably help.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to