Hi, processes in state D looks like locked in a kernel call/device request. Do you have a problem with your storage? This is not cluster related .
Kind regards Fabian On 08/05/2011 01:55 PM, Ulrich Windl wrote: > Hi, > > we run a cluster that has about 30 LVM VGs that are monitored every minute > with a timeout interval of 90s. Surprisingly even if the system is in nominal > state, the LVM monitor times out. > > I suspect this has to do with multiple LVM commands being run in parallel > like this: > # ps ax |grep vg > 2014 pts/0 D+ 0:00 vgs > 2580 ? D 0:00 vgdisplay -v NFS_C11_IO > 2638 ? D 0:00 vgck CBW_DB_BTD > 2992 ? D 0:00 vgdisplay -v C11_DB_Exe > 3002 ? D 0:00 vgdisplay -v C11_DB_15k > 4564 pts/2 S+ 0:00 grep vg > # ps ax |grep vg > 8095 ? D 0:00 vgck CBW_DB_Exe > 8119 ? D 0:00 vgdisplay -v C11_DB_FATA > 8194 ? D 0:00 vgdisplay -v NFS_SAP_Exe > > When I tried a "vgs" manually, it could not be suspended or killed, and it > took more than 30 seconds to complete. > > Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on > a machine with lots of disks, RAM and CPUs). > > As I had changed all the timeouts via "crm configure edit", I suspect the LRM > starts all these monitors at the same time, creating massive parallelism. > Maybe a random star delay would be more useful than having the user specify a > variable start delay for the monitor. Possibly those stuck monitor operations > also affect monitors that would finish in time. > > Here's a part of the mess on one node: > Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on > ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: > CRM_meta_name=[monitor] crm_feature_set=[3.0.5] > CRM_meta_record_pending=[true] CRM_meta_timeout=[30000] > CRM_meta_interval=[10000] volgrpname=[CBW_CI] : pid [29910] timed out > Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation > prm_cbw_ci_mnt_lvm_monitor_10000 (360) Timed Out (timeout=30000ms) > Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation > operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its > parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] > CRM_meta_record_pending=[true] CRM_meta_timeout=[20000] > CRM_meta_interval=[10000] iflabel=[a20] ip=[172.20.17.54] stayed in > operation list for 24020 ms (longer than 10000 ms) > Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation > operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client > 14529, its parameters: CRM_meta_record_pending=[true] > raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] > raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[60000] > CRM_meta_interval=[60000] stayed in operation list for 24010 ms (longer than > 10000 ms) > Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update > relayed from h04 > Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded > fail-count-prm_cbw_ci_mnt_lvm=value++ to 9 > Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush > op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9) > Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update > 416: fail-count-prm_cbw_ci_mnt_lvm=9 > Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update > relayed from h04 > > Regards, > Ulrich > > > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems