Hi Vladislav,

filtering the devices for LVM also was on my mind, but changing the filter 
cluster-wide after any configuration change seems too error-prone for me. On 
the other side I couldn't a generic rule that will match our configuration.

On your attempts managing priority using "nice": Do you know about "ionice"? I 
guess that's closer to what you are looking for.

Regards,
Ulrich


>>> Vladislav Bogdanov <[email protected]> schrieb am 15.11.2011 um 08:25 in
Nachricht <[email protected]>:
> Hi,
> 
> 11.11.2011 10:25, Ulrich Windl wrote:
> > Hi!
> > 
> > I found some obscure problem having to do with LVM multipathing and
> > hot-plugged disks:
> > 
> > I have written some RAs that support "hotplugging of SAN disks" via
> > NPIV (N_Port ID Virtualization) and addition and removal of multipath
> > maps. On to of that is LVM and filesystems.
> > 
> > So fa, so good. However I discovered a problem when multiple
> > resources are shut down in parallel: The LVM-stuff (like vgdisplay)
> > access all disks that are around, and not just the disks that matter.
> > This may lead to a race condition where one resource group stops an
> > LVM monitor, the shuts down the corresponding multipath, and finally
> > the NPIV-device (SCSI unplug). Unfortunately during that another LVM
> > command may access the disks that are clear for removal.
> > 
> > I don't know what exactly happened, bu tthe result was that several
> > vgdisplay commands did hang (unkillable with kill -9 even), multipath
> > commands did hang (device busy through LVM?), and the device could
> > not be removed. As it seems there is some rather global lock involved
> > that makes more and more command hang.
> 
> 
> I experienced the same problem, and solution was three-step:
> 1. Exclude all LVs from being scanned for VGs. Actually I automatically
> edit lvm.conf ("filter" line there). adding new device when appears in
> the system. Default policy is r/.*/. Otherwise lvm accesses/scans all
> not-filtered block devices every time you run lvm commands. If you have
> 1000 LVs on some VG, then they all will be scanned on every request
> after you activate that VG. And that will slowdown subsequent LVM
> commands dramatically. Every request to that LVs consumes some IO, and
> if you're IO-bound, that will take really much time.
> 2. Raise scheduling priority of dlm_controld, dlm kernel threads (not
> sure this has some effect) and clvmd (I run clustered LVM) and priority
> of LVM commands in the RA (with chrt -r 10).
> 3. I use timeout(1) to run LVM commands from RA because yes, LVM
> commands may hang under high IO load. And I re-try the same command on
> timeout.
> 
> After I did that, I performed stress testing - consumed all available IO
> with dozens of disktest instances, and my cluster remained alive, LVM RA
> works as expected. Of course I raised timeouts for RA ops.
> 
> And, I use my own RA which specifically does not run lvm commands on
> monitor op, just [ -d /dev/VG ] or [ -e /dev/VG/LV ], which is
> absolutely enough.
> 
> btw, do you use clvm? Unkillable processes sometimes appear when dlm
> lockspace stuck. If yes, please look for kern_stop for clvmd lockspace
> in dlm_tool ls.
> 
> One side note for clvmd - it should be forced to use corosync stack
> instead of openais. I saw big problems with LCK which is used by default
> if openais modules are loaded by corosync, and guys from
> corosync/openais list said that LCK is too experimental and not heavily
> tested.
> 
> Hope this helps,
> Vladislav
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 
> 

 
 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to