Hi,

processes in state D looks like locked in a kernel call/device request.
Do you have a problem with your storage? This is not cluster related .

Kind regards
Fabian

On 08/05/2011 01:55 PM, Ulrich Windl wrote:
> Hi,
> 
> we run a cluster that has about 30 LVM VGs that are monitored every minute 
> with a timeout interval of 90s. Surprisingly even if the system is in nominal 
> state, the LVM monitor times out.
> 
> I suspect this has to do with multiple LVM commands being run in parallel 
> like this:
> # ps ax |grep vg
>  2014 pts/0    D+     0:00 vgs
>  2580 ?        D      0:00 vgdisplay -v NFS_C11_IO
>  2638 ?        D      0:00 vgck CBW_DB_BTD
>  2992 ?        D      0:00 vgdisplay -v C11_DB_Exe
>  3002 ?        D      0:00 vgdisplay -v C11_DB_15k
>  4564 pts/2    S+     0:00 grep vg
> # ps ax |grep vg
>  8095 ?        D      0:00 vgck CBW_DB_Exe
>  8119 ?        D      0:00 vgdisplay -v C11_DB_FATA
>  8194 ?        D      0:00 vgdisplay -v NFS_SAP_Exe
> 
> When I tried a "vgs" manually, it could not be suspended or killed, and it 
> took more than 30 seconds to complete.
> 
> Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on 
> a machine with lots of disks, RAM and CPUs).
> 
> As I had changed all the timeouts via "crm configure edit", I suspect the LRM 
> starts all these monitors at the same time, creating massive parallelism. 
> Maybe a random star delay would be more useful than having the user specify a 
> variable start delay for the monitor. Possibly those stuck monitor operations 
> also affect monitors that would finish in time.
> 
> Here's a part of the mess on one node:
> Aug  5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on 
> ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: 
> CRM_meta_name=[monitor] crm_feature_set=[3.0.5] 
> CRM_meta_record_pending=[true] CRM_meta_timeout=[30000] 
> CRM_meta_interval=[10000] volgrpname=[CBW_CI] : pid [29910] timed out
> Aug  5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation 
> prm_cbw_ci_mnt_lvm_monitor_10000 (360) Timed Out (timeout=30000ms)
> Aug  5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation 
> operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its 
> parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] 
> CRM_meta_record_pending=[true] CRM_meta_timeout=[20000] 
> CRM_meta_interval=[10000] iflabel=[a20] ip=[172.20.17.54]  stayed in 
> operation list for 24020 ms (longer than 10000 ms)
> Aug  5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation 
> operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 
> 14529, its parameters: CRM_meta_record_pending=[true] 
> raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] 
> raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[60000] 
> CRM_meta_interval=[60000]  stayed in operation list for 24010 ms (longer than 
> 10000 ms)
> Aug  5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update 
> relayed from h04
> Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded 
> fail-count-prm_cbw_ci_mnt_lvm=value++ to 9
> Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush 
> op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9)
> Aug  5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 
> 416: fail-count-prm_cbw_ci_mnt_lvm=9
> Aug  5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update 
> relayed from h04
> 
> Regards,
> Ulrich
> 
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to