On Jul 7, 2014, at 16:53 , Nir Soffer <nsof...@redhat.com> wrote: > ----- Original Message ----- >> From: "Francesco Romani" <from...@redhat.com> >> To: devel@ovirt.org >> Cc: "Nir Soffer" <nsof...@redhat.com> >> Sent: Monday, July 7, 2014 5:04:02 PM >> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling >> of stuck calls >> >> ----- Original Message ----- >>> From: "Nir Soffer" <nsof...@redhat.com> >>> To: "Francesco Romani" <from...@redhat.com> >>> Cc: devel@ovirt.org >>> Sent: Friday, July 4, 2014 10:30:24 PM >>> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling >>> of stuck calls >> >>>> Quick summary of the discussion on gerrit so far: >>>> 1. extract the scheduling logic from the thread pool. Either add a >>>> separate >>>> scheduler class >>>> or let the sampling task reschedule themselves after a succesfull >>>> completion. >>>> In any way the concept of 'periodic task', and the added complexity, >>>> isn't needed. >>> >>> This will also allow task to change the sampling policy depending on the >>> results. >>> If some calls always fails, maybe we can run it less often. >> >> This looks a feature worth having > > But lets take a note and forget about this - we should focus on simple > solution > >>>> 4. the sampling task (or maybe the scheduler) can be smarter and halting >>>> the >>>> sample in presence of >>>> not responding calls for a given VM, granted the VM reports its >>>> 'health'/responsiveness. >>> >>> The scheduler should not do this. The object that ask scheduled the tasks >>> should cancel them if the vm is not responding. >> >> Yes, then we need a separate supervisor thread - or otherwise I don't yet see >> how we can achieve this >> >>> The current model get this right because all calls related to specific vm >>> are run on the same vm thread, so stuck call will prevent all other calls >>> from running. >>> >>> Seems that this is an important requirement, if we have this issue - that >>> all call realted specific vm may start block forever. Did we know that this >>> happens? >> >> Yes. In all the reported cases it was directly or indirectly storage being >> unavailable. >> Directly: libvirt does some access (stat()) for example and gets stuck. >> NFS soft mount doesn't always solve this. >> Indirectly: qemu stuck in D state because it was doing I/O and the storage >> disapperead >> underneath its foots. > > When this happens - does all calls related to specific vm can block, > or it is always the dangerous calls you listed bellow?
the storage stats are "dangerous", but ultimately anything can happen... > >> >>> If one vm may stoop responding, causing all libvirt calls for this vm to >>> block, then a thread pool with one connection per worker thread can lead >>> to a failure when all connection happen to run a request that blocks >>> the thread. If this is the case, then each task related to one vm must >>> depend on other tasks and should not be skipped until the previous task >>> returned, simulating the current threading model without creating 100's >>> of threads. >> >> Agreed, we should introduce this concept and this is lacking in my threadpool >> proposal. > > So basically the current threading model is the behavior we want? > > If some call get stuck, stop sampling this vm. Continue when the > call returns. for stats - yes > > Michal? Federico? > >>>> A. Let's cancel the stuck tasks. >>>> If we move toward a libvirt connection pool, and we give each worker >>>> thread >>>> in the sampling pool >>>> a separate libvirt connection, hopefully read-only, >>> >>> Why read only? >> >> Application of the rule of least privilege. >> Sampling doesn't need to change any VM/domain property, just to read stuff. > > Good idea. > >> >>>> then we should be able to >>>> cancel stuck task by >>>> killing the worker's libvirt connection. We'll still need a (probably >>>> much >>>> simpler) watchman/supervisor, >>>> but no big deal here. >>>> Libvirt allows to close a connection from a different thread. >>>> I haven't actually tried to unstuck a blocked thread this way, but I have >>>> no >>>> reason to believe it >>>> will not work. >>> >>> Lets try? >>> >>> You can block access to storage using iptables, which may cause the block >>> related calls to stuck, and try to close a connection after few seconds >>> from >>> another thread. >> >> Sure, is near to the top of my TODO list. > > Lets wait with this. If we can continue to send other request on the same > connection while one call never return, I don't see a need to close it. > >> >>>> >>>> B. Let's keep around blocked threads >>>> The code as it is just leaves a blocked libvirt call and the worker >>>> thread >>>> that carried it frozen. >>>> The stuck worker thread can be replaced up to a cap of frozen threads. >>>> In this worst case scenario, we end up with one (blocked!) thread per VM, >>>> as >>>> it is today, and with >>>> no sampling data. >>> >>> In the worst case, each vm will cause all threads to be stuck on call >>> related >>> to this vm, since calls can run on any thread in the pool. >> >> True, and this should be avoided any way. >> >>>> On the other hand, by avoiding to reissue dangerous calls, >>> >>> Which are the dangerous calls? >>> >>> If they are related to storage domains, we already have a thread per each >>> domain, so maybe they should run on the domain monitoring thread, no on >>> libvirt >>> threads. >>> >>> We don't have any issue if a domain monitoring thread get stuck - this will >>> simply make the domain unavailable after couple of minutes. >> >> This is an interesting idea, see below for a description of the affected >> calls. >> >>>> Thoughts and comments very welcome! >>> >>> We don't know yet why libvirt calls may stuck, right? >>> >>> libvirt is probably not accessing storage (which may get you in D state) >>> but >>> query qemu which does access storage. So libvirt should be able to abort >>> such >>> calls. >>> >>> Lets make a list of calls that can stuck, and check with the libvirt >>> developers >>> what is the best way to handle such calls. >> >> And here we go: >> >> sampler period (runs every X seconds) >> self._highWrite 2 >> >> self._sampleVmJobs 15 >> >> self._sampleBalloon 15 >> >> self._sampleCpu 15 >> self._sampleNet 15 >> self._sampleVcpuPinning 15 >> self._sampleCpuTune 15 >> >> self._updateVolumes 60 >> self._sampleDisk 60 >> self._sampleDiskLatency 60 >> >> The potentially blocking samplers are >> _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk >> >> * _highWrite uses virDomainGetBlockInfo which, depending on the storage >> configuration, >> could enter the QEMU monitor. >> libvirt needs to do some direct access (stat) anyway, the, depending on the >> device: >> - file-based devices: libvirt does the access directly > > We call virDomainGetBlockInfo only for block storage domain and qcow format: > > # vm.py: > 2470 def _getExtendCandidates(self): > > > 2471 ret = [] > 2472 > 2473 mergeCandidates = self._getLiveMergeExtendCandidates() > 2474 for drive in self._devices[DISK_DEVICES]: > 2475 if not drive.blockDev or drive.format != 'cow': > 2476 continue > 2477 > 2478 capacity, alloc, physical = self._dom.blockInfo(drive.path, > 0) > >> - block-based devices: libvirt enters the QEMU monitor to ask for the last >> block extent > > This is our only usage. > >> >> * _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe. > > Are should be safe really safe? can you check with libvirt developers > about that? > >> >> * _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should >> be safe. >> >> * _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to >> enter >> the domain monitor. > > And the domain monitor may be stuck when storage is not available? > >> >> * _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU >> monitor. >> However, this needs to run only if there are active block jobs. >> I don't have numbers, but I expect this sampler to be idle most of time. > > Adam: This is related to live merge right? > >> >> * _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for >> affinity stuff), >> should be safe >> >> * _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups, >> virDomainGetVcpusFlags >> and virDomainGetMetadata, which both access the domain definition (XML >> inside libvirt), >> should be safe >> >> * _updateVolume uses irs.getVolumeSize which I guess it is safe >> >> * _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU >> monitor. >> >> * _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter >> the QEMU monitor. >> This call looks like a generalized version of the above, so I can maybe >> optimize the two >> somehow by issuing a single call to libvirt and gathering all the stats in >> one go. >> >> _highWrite is on a class of its own. Could be a candidate to be moved, >> at least partially (storage health monitoring) in the storage. The rest of >> its functionalty >> may stay in virt, or may be moved to storage as well. Let's see. >> >> In general, we can distinguish the calls in >> >> * storage/VM health (prevent VM misbehaving on storage exausted): _highWrite >> * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency. >> We can optimize the later two, looks wasteful to ask for data, discard some, >> then ask again for the missing data. Need to check carefully if the API >> allows to do both in one go, but looks feasible. >> * balloon stats: _sampleBalloon: needs to enter the QEMU monitor >> * everything else: does not touch disk nor qemu, so no special treatment >> here. > > If we separate storage calls so vm sampling continue even if storage > cause qemu to block - do we gain anything? I guess this vm is > not useful if its storage is not available, and will soon will > pause anyway. > > Looks like the simplest solution is the best and fancy separation > of sampling calls is needed. you meant "not needed", no? > > Nir _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel