Hi,

we run a cluster that has about 30 LVM VGs that are monitored every minute with 
a timeout interval of 90s. Surprisingly even if the system is in nominal state, 
the LVM monitor times out.

I suspect this has to do with multiple LVM commands being run in parallel like 
this:
# ps ax |grep vg
 2014 pts/0    D+     0:00 vgs
 2580 ?        D      0:00 vgdisplay -v NFS_C11_IO
 2638 ?        D      0:00 vgck CBW_DB_BTD
 2992 ?        D      0:00 vgdisplay -v C11_DB_Exe
 3002 ?        D      0:00 vgdisplay -v C11_DB_15k
 4564 pts/2    S+     0:00 grep vg
# ps ax |grep vg
 8095 ?        D      0:00 vgck CBW_DB_Exe
 8119 ?        D      0:00 vgdisplay -v C11_DB_FATA
 8194 ?        D      0:00 vgdisplay -v NFS_SAP_Exe

When I tried a "vgs" manually, it could not be suspended or killed, and it took 
more than 30 seconds to complete.

Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a 
machine with lots of disks, RAM and CPUs).

As I had changed all the timeouts via "crm configure edit", I suspect the LRM 
starts all these monitors at the same time, creating massive parallelism. Maybe 
a random star delay would be more useful than having the user specify a 
variable start delay for the monitor. Possibly those stuck monitor operations 
also affect monitors that would finish in time.

Here's a part of the mess on one node:
Aug  5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on 
ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: 
CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] 
CRM_meta_timeout=[30000] CRM_meta_interval=[10000] volgrpname=[CBW_CI] : pid 
[29910] timed out
Aug  5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation 
prm_cbw_ci_mnt_lvm_monitor_10000 (360) Timed Out (timeout=30000ms)
Aug  5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation 
monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: 
CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] 
CRM_meta_timeout=[20000] CRM_meta_interval=[10000] iflabel=[a20] 
ip=[172.20.17.54]  stayed in operation list for 24020 ms (longer than 10000 ms)
Aug  5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation 
monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its 
parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] 
crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] 
CRM_meta_name=[monitor] CRM_meta_timeout=[60000] CRM_meta_interval=[60000]  
stayed in operation list for 24010 ms (longer than 10000 ms)
Aug  5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed 
from h04
Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded 
fail-count-prm_cbw_ci_mnt_lvm=value++ to 9
Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9)
Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 
416: fail-count-prm_cbw_ci_mnt_lvm=9
Aug  5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed 
from h04

Regards,
Ulrich




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to