Hi, we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out.
I suspect this has to do with multiple LVM commands being run in parallel like this: # ps ax |grep vg 2014 pts/0 D+ 0:00 vgs 2580 ? D 0:00 vgdisplay -v NFS_C11_IO 2638 ? D 0:00 vgck CBW_DB_BTD 2992 ? D 0:00 vgdisplay -v C11_DB_Exe 3002 ? D 0:00 vgdisplay -v C11_DB_15k 4564 pts/2 S+ 0:00 grep vg # ps ax |grep vg 8095 ? D 0:00 vgck CBW_DB_Exe 8119 ? D 0:00 vgdisplay -v C11_DB_FATA 8194 ? D 0:00 vgdisplay -v NFS_SAP_Exe When I tried a "vgs" manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). As I had changed all the timeouts via "crm configure edit", I suspect the LRM starts all these monitors at the same time, creating massive parallelism. Maybe a random star delay would be more useful than having the user specify a variable start delay for the monitor. Possibly those stuck monitor operations also affect monitors that would finish in time. Here's a part of the mess on one node: Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[30000] CRM_meta_interval=[10000] volgrpname=[CBW_CI] : pid [29910] timed out Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation prm_cbw_ci_mnt_lvm_monitor_10000 (360) Timed Out (timeout=30000ms) Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[20000] CRM_meta_interval=[10000] iflabel=[a20] ip=[172.20.17.54] stayed in operation list for 24020 ms (longer than 10000 ms) Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[60000] CRM_meta_interval=[60000] stayed in operation list for 24010 ms (longer than 10000 ms) Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded fail-count-prm_cbw_ci_mnt_lvm=value++ to 9 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9) Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 416: fail-count-prm_cbw_ci_mnt_lvm=9 Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
