On 27/11/13 18:20, Daniel P. Berrange wrote: > On Wed, Nov 27, 2013 at 06:10:47PM +0000, Edward Hope-Morley wrote: >> On 27/11/13 17:43, Daniel P. Berrange wrote: >>> On Wed, Nov 27, 2013 at 05:39:30PM +0000, Edward Hope-Morley wrote: >>>> On 27/11/13 15:49, Daniel P. Berrange wrote: >>>>> On Wed, Nov 27, 2013 at 02:45:22PM +0000, Edward Hope-Morley wrote: >>>>>> Moving this to the ml as requested, would appreciate >>>>>> comments/thoughts/feedback. >>>>>> >>>>>> So, I recently proposed a small patch to the oslo rpc code (initially in >>>>>> oslo-incubator then moved to oslo.messaging) which extends the existing >>>>>> support for limiting the rpc thread pool so that concurrent requests can >>>>>> be limited based on type/method. The blueprint and patch are here: >>>>>> >>>>>> https://blueprints.launchpad.net/oslo.messaging/+spec/rpc-concurrency-control >>>>>> >>>>>> The basic idea is that if you have server with limited resources you may >>>>>> want restrict operations that would impact those resources e.g. live >>>>>> migrations on a specific hypervisor or volume formatting on particular >>>>>> volume node. This patch allows you, admittedly in a very crude way, to >>>>>> apply a fixed limit to a set of rpc methods. I would like to know >>>>>> whether or not people think this is sort of thing would be useful or >>>>>> whether it alludes to a more fundamental issue that should be dealt with >>>>>> in a different manner. >>>>> Based on this description of the problem I have some observations >>>>> >>>>> - I/O load from the guest OS itself is just as important to consider >>>>> as I/O load from management operations Nova does for a guest. Both >>>>> have the capability to impose denial-of-service on a host. IIUC, the >>>>> flavour specs have the ability to express resource constraints for >>>>> the virtual machines to prevent a guest OS initiated DOS-attack >>>>> >>>>> - I/O load from live migration is attributable to the running >>>>> virtual machine. As such I'd expect that any resource controls >>>>> associated with the guest (from the flavour specs) should be >>>>> applied to control the load from live migration. >>>>> >>>>> Unfortunately life isn't quite this simple with KVM/libvirt >>>>> currently. For networking we've associated each virtual TAP >>>>> device with traffic shaping filters. For migration you have >>>>> to set a bandwidth cap explicitly via the API. For network >>>>> based storage backends, you don't directly control network >>>>> usage, but instead I/O operations/bytes. Ultimately though >>>>> there should be a way to enforce limits on anything KVM does, >>>>> similarly I expect other hypervisors can do the same >>>>> >>>>> - I/O load from operations that Nova does on behalf of a guest >>>>> that may be running, or may yet to be launched. These are not >>>>> directly known to the hypervisor, so existing resource limits >>>>> won't apply. Nova however should have some capability for >>>>> applying resource limits to I/O intensive things it does and >>>>> somehow associate them with the flavour limits or some global >>>>> per user cap perhaps. >>>>> >>>>>> Thoughts? >>>>> Overall I think that trying to apply caps on the number of API calls >>>>> that can be made is not really a credible way to avoid users inflicting >>>>> DOS attack on the host OS. Not least because it does nothing to control >>>>> what a guest OS itself may do. If you do caps based on num of APIs calls >>>>> in a time period, you end up having to do an extremely pessistic >>>>> calculation - basically have to consider the worst case for any single >>>>> API call, even if most don't hit the worst case. This is going to hurt >>>>> scalability of the system as a whole IMHO. >>>>> >>>>> Regards, >>>>> Daniel >>>> Daniel, thanks for this, these are all valid points and essentially tie >>>> with the fundamental issue of dealing with DOS attacks but for this bp I >>>> actually want to stay away from this area i.e. this is not intended to >>>> solve any tenant-based attack issues in the rpc layer (although that >>>> definitely warrants a discussion e.g. how do we stop a single tenant >>>> from consuming the entire thread pool with requests) but rather I'm >>>> thinking more from a QOS perspective i.e. to allow an admin to account >>>> for a resource bias e.g. slow RAID controller, on a given node (not >>>> necessarily Nova/HV) which could be alleviated with this sort of crude >>>> rate limiting. Of course one problem with this approach is that >>>> blocked/limited requests still reside in the same pool as other requests >>>> so if we did want to use this it may be worth considering offloading >>>> blocked requests or giving them their own pool altogether. >>>> >>>> ...or maybe this is just pie in the sky after all. >>> I don't think it is valid to ignore tenant-based attacks in this. You >>> have a single resource here and it can be consumed by the tenant >>> OS, by the VM associated with the tenant or by Nova itself. As such, >>> IMHO adding rate limiting to Nova APIs alone is a non-solution because >>> you've still left it wide open to starvation by any number of other >>> routes which are arguably even more critical to address than the API >>> calls. >>> >>> Daniel >> Daniel, maybe I have misunderstood you here but with this optional >> extension I am (a) not intending to solve DOS issues and (b) not >> "ignoring" DOS issues since I do not expect to be adding any beyond or >> accentuating those that already exist. The issue here is QOS not DOS. > I consider QOS & DOS to be two sides of the same coin here. A denial of > service is anything which affects the quality of service of the host. > It doesn't have to be done with malicious intent either. I don't think > your proposal provides significant QOS benefits except under some very > narrowly constrained scenario, of which I'm yet to be convinced is > very applicable to the bigger picture / real world deployment scneario. > > Daniel Daniel,
lets flip this coin for a second and say if I have a cinder-volume node that is using LVM with comparitively slow disks to some other nodes and therefore I wanted to avoid having too many volume delete requests (which will zero out the whole volume) from starving disk IO from tenant reads/writes on those disks. Do we have a way to prevent this currently? If not then I think this extension may come in handy. _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
