On Mon, Jun 15, 2015 at 09:01:53AM +0200, Christian Borntraeger wrote: > Am 13.06.2015 um 22:10 schrieb Michael S. Tsirkin: > > On Fri, Jun 12, 2015 at 01:56:37PM +0200, Christian Borntraeger wrote: > >> Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: > >>> On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: > >>>> On 09/06/15 13:37, Christian Borntraeger wrote: > >>>>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > >>>>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, > >>>>>> when Linux is under severe memory pressure. Various mechanisms are > >>>>>> responsible for correct virtio_balloon memory management. Nevertheless > >>>>>> it > >>>>>> is often the case that these control tools does not have enough time to > >>>>>> react on fast changing memory load. As a result OS runs out of memory > >>>>>> and > >>>>>> invokes OOM-killer. The balancing of memory by use of the virtio > >>>>>> balloon > >>>>>> should not cause the termination of processes while there are pages in > >>>>>> the > >>>>>> balloon. Now there is no way for virtio balloon driver to free memory > >>>>>> at > >>>>>> the last moment before some process get killed by OOM-killer. > >>>>>> > >>>>>> This does not provide a security breach as balloon itself is running > >>>>>> inside Guest OS and is working in the cooperation with the host. Thus > >>>>>> some improvements from Guest side should be considered as normal. > >>>>>> > >>>>>> To solve the problem, introduce a virtio_balloon callback which is > >>>>>> expected to be called from the oom notifier call chain in > >>>>>> out_of_memory() > >>>>>> function. If virtio balloon could release some memory, it will make the > >>>>>> system return and retry the allocation that forced the out of memory > >>>>>> killer to run. > >>>>>> > >>>>>> This behavior should be enabled if and only if appropriate feature bit > >>>>>> is set on the device. It is off by default. > >>>>> The balloon frees pages in this way > >>>>> > >>>>> static void balloon_page(void *addr, int deflate) > >>>>> { > >>>>> #if defined(__linux__) > >>>>> if (!kvm_enabled() || kvm_has_sync_mmu()) > >>>>> qemu_madvise(addr, TARGET_PAGE_SIZE, > >>>>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > >>>>> #endif > >>>>> } > >>>>> > >>>>> The guest can re-touch that page and get a empty zero or the old page > >>>>> back without > >>>>> tampering the host integrity. This should work for all cases I am aware > >>>>> of (without sync_mmu its a nop anyway) so why not enable that by > >>>>> default? Anything that I missed? > >>>>> > >>>>> Christian > >>>> > >>>> I'd like to do that :) Actually original version of kernel patch > >>>> has enabled this unconditionally. But Michael asked to make > >>>> it configurable and off by default. > >>>> > >>>> Den > >>> > >>> That's not the question here. The question is why is it limited by > >>> kvm_has_sync_mmu. > >> > >> Well we have two interesting options here: > >> > >> VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM > >> > >> For any sane host with ondemand paging just re-accessing the page > >> should simply work. So the common case could be > >> VIRTIO_BALLOON_F_MUST_TELL_HOST == off > > > > Disabling this breaks useful optimizations such as > > ability not to migrate memory in the balloon. > > memory in the balloon is usually backed by the empty zero page after > the madvise (WONT_NEED will finally result in zap_pte_range for the > common case). In a ideal world migration should be able to optimize > zero pages.
This still involves reading them in as opposed to just skipping them. > > >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on > > > > AFAIK management tools depend on balloon not deflating > > below host-specified threshold to avoid OOM on the host. > > So I don't think we can make this a default, > > management needs to enable this explicitly. > > If the ballooning is required to keep the host memory managedment > from OOM - iow abusing ballooning as memory hotplug between guests > then yes better let the guest oom - that makes sense. > > Now: I think that doing so (not having enough swap in the host if > all guests deflate) and relying on balloon semantics is fundamentally > broken. Let me explain this: The problem is that we rely on guest > cooperation for the host integrity. As I explained using madvise > WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As > soon as the guest kernel re-touches the page (e.g. a malicious > kernel module - not the balloon driver) it will be backed by the VMAs > default method - so usually with a shared R/O copy of the empty > zero page. Write accesses will result in a copy-on-write and allocate > new memory in the host. > There is nothing we can do in the balloon protocol to protect the host > against malicious guests allocating all the maximum memory. If we want to try and harden host, we can unmap it so guest will crash if it touches pages without deflate. > If you need host integrity against guest memory usage, something like > cgroups_memory or so is probably the only reliable way. In the original design, protection against a malicious guest is not the point of the balloon, it's a technology that let you overcommit cooperative guests. > > > >> Only for the rare case of hypervisors without paging or other memory > >> related restrictions we have to enable MUST_TELL_HOST. > >> Now: QEMU knows exactly which case we have, so why not let QEMU tell > >> the guest what the capabilities are. (e.g. sync_mmu ---> no need to > >> tell the host). > >> > >> I can at least imaging that some admin wants to make the the oom case > >> configurable, but a sane default seems to be to not kill random > >> guest processes. > >> > >> Christian > > > >