* Markus Armbruster (arm...@redhat.com) wrote: > "Dr. David Alan Gilbert" <dgilb...@redhat.com> writes: > > > * Markus Armbruster (arm...@redhat.com) wrote: > >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> writes: > >> > >> > * Markus Armbruster (arm...@redhat.com) wrote: > >> >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> writes: > >> >> > >> >> > * Markus Armbruster (arm...@redhat.com) wrote: > >> >> >> We sometimes use g_new() & friends, which abort() on OOM, and > >> >> >> sometimes > >> >> >> g_try_new() & friends, which can fail, and therefore require error > >> >> >> handling. > >> >> >> > >> >> >> HACKING points out the difference, but is mum on when to use what: > >> >> >> > >> >> >> 3. Low level memory management > >> >> >> > >> >> >> Use of the > >> >> >> malloc/free/realloc/calloc/valloc/memalign/posix_memalign > >> >> >> APIs is not allowed in the QEMU codebase. Instead of these > >> >> >> routines, > >> >> >> use the GLib memory allocation routines g_malloc/g_malloc0/g_new/ > >> >> >> g_new0/g_realloc/g_free or QEMU's > >> >> >> qemu_memalign/qemu_blockalign/qemu_vfree > >> >> >> APIs. > >> >> >> > >> >> >> Please note that g_malloc will exit on allocation failure, so > >> >> >> there > >> >> >> is no need to test for failure (as you would have to with > >> >> >> malloc). > >> >> >> Calling g_malloc with a zero size is valid and will return NULL. > >> >> >> > >> >> >> Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the > >> >> >> following > >> >> >> reasons: > >> >> >> > >> >> >> a. It catches multiplication overflowing size_t; > >> >> >> b. It returns T * instead of void *, letting compiler catch > >> >> >> more type > >> >> >> errors. > >> >> >> > >> >> >> Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, > >> >> >> though. > >> >> >> > >> >> >> Memory allocated by qemu_memalign or qemu_blockalign must be > >> >> >> freed with > >> >> >> qemu_vfree, since breaking this will cause problems on Win32. > >> >> >> > >> >> >> Now, in my personal opinion, handling OOM gracefully is worth the > >> >> >> (commonly considerable) trouble when you're coding for an Apple II or > >> >> >> similar. Anything that pages commonly becomes unusable long before > >> >> >> allocations fail. > >> >> > > >> >> > That's not always my experience; I've seen cases where you suddenly > >> >> > allocate a load more memory and hit OOM fairly quickly on that hot > >> >> > process. Most of the time on the desktop you're right. > >> >> > > >> >> >> Anything that overcommits will send you a (commonly > >> >> >> lethal) signal instead. Anything that tries handling OOM gracefully, > >> >> >> and manages to dodge both these bullets somehow, will commonly get it > >> >> >> wrong and crash. > >> >> > > >> >> > If your qemu has maped it's main memory from hugetlbfs or similar > >> >> > pools > >> >> > then we're looking at the other memory allocations; and that's a bit > >> >> > of > >> >> > an interesting difference where those other allocations should be a > >> >> > lot > >> >> > smaller. > >> >> > > >> >> >> But others are entitled to their opinions as much as I am. I just > >> >> >> want > >> >> >> to know what our rules are, preferably in the form of a patch to > >> >> >> HACKING. > >> >> > > >> >> > My rule is to try not to break a happily running VM by some new > >> >> > activity; I don't worry about it during startup. > >> >> > > >> >> > So for example, I don't like it when starting a migration, allocates > >> >> > some more memory and kills the VM - the user had a happy stable VM > >> >> > upto that point. Migration gets the blame at this point. > >> >> > >> >> I don't doubt reliable OOM handling would be nice. I do doubt it's > >> >> practical for an application like QEMU. > >> > > >> > Well, our use of glib certainly makes it much much harder. > >> > I just try and make sure anywhere that I'm allocating a non-trivial > >> > amount of memory (especially anything guest or user controlled) uses > >> > the _try_ variants. That should keep a lot of the larger allocations. > >> > >> Matters only when your g_try_new()s actually fail (which they won't, at > >> least not reliably), and your error paths actually work (which they > >> won't unless you test them, no offense). > >> > >> > However, it scares me that we've got things that can return big chunks > >> > of JSON for example, and I don't think they're being careful about it. > >> > >> We got countless allocations small and large (large as in Gigabytes) > >> that kill QEMU on OOM. Some of the small allocations add up to > >> Megabytes (QObjects for JSON work, for example). > >> > >> Yet the *practical* problem isn't lack of graceful handling when these > >> allocations fail. Because they pretty much don't. > >> > >> The practical problem I see is general confusion on what to do about > >> OOM. There's no written guidance. Vague rules of thumb on when to > >> handle OOM are floating around. Code gets copied. Unsurprisingly, OOM > >> handling is a haphazard affair. > > > >> In this state, whatever OOM handling we have is too unreliable to be > >> worth much, since it can only help when (1) allocations actually fail > >> (they generally don't), and (2) the allocation that fails is actually > >> handled (they generally aren't), and (3) the handling actually works (we > >> don't test OOM, so it generally doesn't). > >> > >> For the sake of the argument, let's assume there's a practical way to > >> run QEMU so that memory allocations actually fail. We then still need > >> to find a way to increase the probability for failed allocations to be > >> actually handled, and the probability for the error handling to actually > >> work, both to a useful level. This will require rules on OOM handling, > >> a strategy to make them stick, a strategy to test OOM, and resources to > >> implement all that. > > > > There's probably no way to guarantee we've got all paths, however we > > can test in restricted memory environments. > > For example we could set up a test environment that runs a series of > > hotplug or migration tests (say avocado or something) in cgroups > > or nested VMs with random reduced amounts of RAM. These will blow up > > spectacularly and we can slowly attack some of the more common paths. > > There's also fault injection. It's more targeted. Bonus: it lets you > make only the allocations fail you deem likely to fail, i.e. keep the > unchecked ones working ;-P > > > If we can find common cases then perhaps we can identify things to use > > static checkers for. > > > > We can also try setting up tests in environments closer to the way > > OpenStack and oVirt configure they're hosts; they seem to jump through > > hoops to get a feeling of how much spare memory to allocate, but of > > course since we don't define how much we use they can't really do that. > > > > Using mlock would probably make the allocations more likely to > > fail rather than fault later? > > "More likely" as in "at all likely". Without a solution here, all the > other work is on unreachable code. My box lets me allocate Gigabytes of > memory I don't have. In case you find my example involving -device qxl > is too opaque, I append a test program. It successfully allocates one > Terabyte in 1024 Gigabyte chunks for me. It behaves exactly the same > with g_malloc() instead of malloc(). > > The "normal" way to disable memory overcommit is > /proc/sys/vm/overcommit_memory, but it's system-wide, and requires root. > That's a big hammer. A more precise tool could be more useful. To > actually matter, we need a tool that libvirt can apply to production > VMs. > > >> Will the benefits be worth the effort? Arguing about that in the > >> near-total vacuum we have now is unlikely to be productive. To ground > >> the debate at least somewhat, I'd like those of us in favour of OOM > >> handling to propose a first draft of OOM handling rules. > > > > Well, I'm up to give it a go; but before I do, can you define a bit more > > what you want. Firstly what do you define as 'OOM handling' and secondly > > what type of level of rules do you want. > > 0. OOM is failure to allocate a chunk of memory. > > 1. When is it okay to terminate the process on OOM? > > Write down rules that let people decide whether a given allocation > needs to be handled gracefully. > > If your rules involve small vs. large allocations, then make sure to > define "small". > > If your rules involve "in response to untrusted input", spell that > out. > > If your rules involve "in response to trusted input (think QMP)", > spell that out. > > Also: exit() or abort()? If I remember correctly, GLib aborts. > > 2. How to handle OOM gracefully [skip for first draft] > > The usual: revert the functions side effects, return failure to > caller, repeat for caller until reaching the caller that consumes the > error. > > 3. Coding conventions [definitely skip for first draft] > > This is part of the "strategy to make the rules stick".
OK, how about this as a strawman: https://wiki.qemu.org/Features/AllocationFailures Dave > >> If we can't do even that, I'll be tempted to shoot down OOM handling in > >> patches to code I maintain. > > > > Please please don't do that; getting it right in the monitor path and > > QMP is import for those cases where we generate big chunks of JSON > > (it would be better if we didn't generate big chunks of JSON, but that's > > a partially separate problem). > > As long as allocations don't fail, this is all mental masturbation > (pardon my french). > > > > #include <stdlib.h> > #include <stdio.h> > > int > main(void) > { > size_t GiB = 1024 * 1024 * 1024; > int i; > void *p; > > for (i = 0; i < 1024; i++) { > printf("%d\n", i); > p = malloc(GiB); > if (!p) { > printf("OOM\n"); > break; > } > } > return 0; > } -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK