Hi folks,
I've just upgraded a 2.7.0 cluster to 2.10.3 and thought I'd take advantage of the new HA resource agents. Sadly, I find that the resource agent successfully mounts the OSDs, then the resource stops (leaving the OSDs mounted). Here's an example case, the management OSD Created with the following: # pcs resource create MGT ocf:lustre:Lustre target=/dev/disk/by-label/MGS mountpoint=/mnt/MGT; pcs constraint location MGT prefers hpctestmds1=100 Results in the following, leaving the resource stopped but the MGT mounted: Mar 07 13:28:22 hpctestmds1.our.domain Lustre(MGT)[32115]: ERROR: /dev/disk/by-label/MGS is not mounted Mar 07 13:28:22 hpctestmds1.our.domain crmd[11459]: notice: Result of probe operation for MGT on hpctestmds1: 7 (not running) Mar 07 13:28:22 hpctestmds1.our.domain Lustre(MGT)[32128]: INFO: Starting to mount /dev/disk/by-label/MGS Mar 07 13:28:22 hpctestmds1.our.domain kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Mar 07 13:28:22 hpctestmds1.our.domain kernel: Lustre: MGS: Connection restored to 9eb39832-a281-1088-d816-410b918b5813 (at 0@lo) Mar 07 13:28:22 hpctestmds1.our.domain kernel: Lustre: Skipped 6 previous similar messages Mar 07 13:28:22 hpctestmds1.our.domain Lustre(MGT)[32173]: INFO: /dev/disk/by-label/MGS mounted successfully Mar 07 13:28:22 hpctestmds1.our.domain crmd[11459]: notice: Result of start operation for MGT on hpctestmds1: 0 (ok) Mar 07 13:28:22 hpctestmds1.our.domain Lustre(MGT)[32189]: ERROR: /dev/disk/by-label/MGS is not mounted Mar 07 13:28:22 hpctestmds1.our.domain crmd[11459]: notice: Result of stop operation for MGT on hpctestmds1: 0 (ok) Mar 07 13:28:23 hpctestmds1.our.domain Lustre(MGT)[32207]: INFO: Starting to mount /dev/disk/by-label/MGS Mar 07 13:28:23 hpctestmds1.our.domain Lustre(MGT)[32215]: ERROR: mount failed Mar 07 13:28:23 hpctestmds1.our.domain Lustre(MGT)[32221]: ERROR: /dev/disk/by-label/MGS can not be mounted with this error: 1 Mar 07 13:28:23 hpctestmds1.our.domain lrmd[11456]: notice: MGT_start_0:32200:stderr [ mount.lustre: according to /etc/mtab /dev/sde is already mounted on /mnt/MGT ] Mar 07 13:28:23 hpctestmds1.our.domain crmd[11459]: notice: Result of start operation for MGT on hpctestmds1: 1 (unknown error) Mar 07 13:28:23 hpctestmds1.our.domain crmd[11459]: notice: hpctestmds1-MGT_start_0:558 [ mount.lustre: according to /etc/mtab /dev/sde is already mounted on /mnt/MGT\n ] Mar 07 13:28:23 hpctestmds1.our.domain crmd[11459]: notice: Result of stop operation for MGT on hpctestmds1: 0 (ok) I then delete the resource, unmount the MGT, and make a new resource with the old ocf:heartbeat:Filesystem agent, setting the options to match the defaults from the ocf:lustre:Lustre agent, as follows: # pcs resource create MGT Filesystem device=/dev/disk/by-label/MGS directory=/mnt/MGT fstype="lustre" meta op monitor interval="20" timeout="300" op start interval="0" timeout="300" op stop interval="0" timeout="300"; pcs constraint location MGT prefers hpctestmds1=100 This results in a happier resource start. Pacemaker resource stays "Started" and mount persists. From journalctl: Mar 07 13:35:07 hpctestmds1.our.domain crmd[11459]: notice: Result of probe operation for MGT on hpctestmds1: 7 (not running) Mar 07 13:35:07 hpctestmds1.our.domain Filesystem(MGT)[744]: INFO: Running start for /dev/disk/by-label/MGS on /mnt/MGT Mar 07 13:35:07 hpctestmds1.our.domain kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Mar 07 13:35:07 hpctestmds1.our.domain kernel: Lustre: MGS: Connection restored to 9eb39832-a281-1088-d816-410b918b5813 (at 0@lo) Mar 07 13:35:07 hpctestmds1.our.domain crmd[11459]: notice: Result of start operation for MGT on hpctestmds1: 0 (ok) Has anyone experience similar results? Any tips? Cheers CanWood
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org