I have 2 MDS, configured as an active/standby pair. I have 5 OSTs that are NOT active/standby. I have 5 clients. I am using Lustre 1.6.5, due to bug 18232 <https://bugzilla.lustre.org/show_bug.cgi?id=18232> which only affects 1.6.6. Using Lustre 1.6.5, when I reset my active node, the standby takes over. This is quite reliable. Today, I did the following in this order: Unmounted all the clients Rebooted all the clients Stopped Linux HA from running Unmounted the OSTs Unmounted the MDS Rebooted the OSTs Rebooted both MDSes When the MDSes started up, Linux HA chose one to be active. That system mounted the MDT. I looked at the file /proc/fs/lustre/mds/tacc-MDT0000/recovery_status, and it showed: [r...@ts-tacc-01 ~]# cat /proc/fs/lustre/mds/tacc-MDT0000/recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/5 completed_clients: 0/5 replayed_requests: 0/?? queued_requests: 0 next_transno: 17768 ***** Note that recovery_start and time_remaining are both zero. ***** I waited a several minutes, and this file was the same. I was waiting for recovery to complete before trying to mount the OSTs. However, it appears that this would never occur! Does this look like a bug? --------------------------- I format my MDT using the following command. The command is run from 10.2.43.1, and the failnode is 10.2.43.2: mkfs.lustre --reformat --fsname tacc --mdt --mgs --device-size=10000000 --mkfsoptions=' -m 0 -O mmp' --failnode=10.2.4...@o2ib0 /dev/sdb I format the OSTs using the following command: /usr/bin/time -p mkfs.lustre --reformat --ost --mkfsoptions='-J device=/dev/sdc1 -m 0' --fsname tacc --device-size=400000000 --mgsnode=10.2.4...@o2ib0 --mgsnode=10.2.4...@o2ib0 /dev/sdb I mount the clients using: mount -t lustre 10.2.4...@o2ib:10.2.4...@o2ib:/tacc /mnt/lustre
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
