libvirt User Mode Linux driver and other new features

It has been a while since I reported on libvirt development news, but that doesn't mean we've been idle. The big news is the introduction of another new hypervisor driver in libvirt, this time for User Mode Linux. While Xen / KVM get all the press these days, UML has been quietly providing virtualization for Linux users for many years - until very recently nearly all Linux virtual server providers were deploying User Mode Linux guests. libvirt aims to be the universal management API for all virtualization technologies, and UML has no formal API of its own, so it is only natural that we provide a UML driver in libvirt. It is still at a fairly basic level of functionality, only supporting disks & paravirt consoles, but it is enough to get a guest booted & interact locally. The next step is adding networking support at which point it'll be genuinely useful. To recap, libvirt now has drivers for Xen, QEMU, KVM, OpenVZ, LXC (LinuX native Containers) and UML, as well as a test driver & RPC support.

In other news, a couple of developers at VirtualIron have recently contributed some major new features to libvirt. The first set of APIs provides the ability to register for lifecycle events against domains, allowing an application to be notified whenever a domain stops, starts, migrates, etc, rather than having to continually poll for status changes. This is implemented for KVM and Xen so far. The second huge set of APIs provide a way to query a host for details of all the hardware devices it has. This is a key building block to allow remote management tools to assign PCI/USB devices directly to guest VMs, and to more intelligently configure networking and storage. Think of it as a remotely accessible version of HAL. In fact, we use HAL as one of the backend implementations for the API, or as an alternative, the new DeviceKit service.

Friday, July 25, 2008

kernel-xen is dead. Long live kernel + paravirt_ops

In Fedora 9 we discontinued our long standing forward-port of Xen's 2.6.18 kernel tree, and switch to a generic LKML tree (which already had i386 Xen DomU pv_ops), and added a set of patches to support x86_64 Xen DomU pv_ops. While it lacks functionality compared to the previous Xen kernels, and was certainly less stable for a while, overall this was a great success in terms of maintainability. It was still a separate kernel RPM though...

Jeremy Fitzhardinge has meanwhile continued to improve the Xen i386 pv_ops tree stability & functionality in upstream LKML, and has also taken the hacky Fedora x86_64 pv_ops patches, majorly cleaned them up & worked them into a form that was acceptable for upstream. A couple of days ago Ingo sent Jeremy's work onto Linus who promptly merged it for 2.6.27

Fedora 10 Rawhide of course is tracking 2.6.27 so yesterday Mark McLoughlin turned on Xen pv_ops in the main kernel RPM, and killed off 'kernel-xen'.

So for Fedora 10 we'll have one kernel RPM to rule them all. By the magic of pv_ops, auto-detecting whether its running bare metal, Xen, VMWare (VMI) or KVM at boot and optimizing itself for each platform!

There's only one small wrinkle, that isn't really Xen's fault. Anaconda install images on 32-bit Fedora are a i586 non-PAE kernel. Xen 32-bit is i686, PAE only, so we still need to have a separate initrd and vmlinux for installation - but at least its derived from the general purpose 'kernel-PAE' binary, instead of 'kernel-xen'. Of course 64-bit doesn't have this complication. Someone just needs to fix 32-bit Linux so it can auto-switch between non-PAE and PAE at runtime. It was always said to be impossible to unify UP & SMP kernels...until someone actually did it. Now just need someone to do the impossible for PAE and all will be right with the world :-)

It's taken a long time & alot of work by many many people to get Xen's DomU kernel bits merged upstream, so congratulations to all involved on getting another architecture merged, enabling us to finally take full advantage of paravirt_ops in Fedora's Xen kernels.

Thursday, June 26, 2008

New Java bindings for libvirt

DV has recently been looking at the issue of Java bindings for libvirt. A few months back a libvirt community member, Tóth István, contributed most of the code for Java bindings to libvirt. Daniel has now taken this codebase added a build system, and is hosting it in the libvirt CVS repository and done a formal release. This should be hitting Fedora 10 rawhide in the near future, meaning we now have bindings for C, Perl, Python, OCaml, Ruby and Java. Now who wants to do a PHP binding....that's the only other language commonly requested

Wednesday, June 18, 2008

Red Hat Summit 2008

Just finished my talk at the Red Hat Summit on libvirt and virtualization tools. For those who are interested, the I've now posted the slides online.

Friday, May 23, 2008

Better living through API design: low level memory allocation

The libvirt library provides a core C API upon which language specific bindings are also built. Being written in C of course means we have to do our memory management / allocation. Historically we've primarily used the standard malloc/realloc/calloc/free APIs that even knows and loves/hates, although we do have an internal variable length string API (more on that in a future post). Of course there are many other higher level memory management APIs (obstacks/memory pools, garbage collectors, etc) you can write C code with too, but in common with the majority of other other apps/code we happen to be using the general purpose malloc/free pattern everywhere. I've recently been interested in improving our code quality, reliability and testing, and found myself thinking about whether there was a way to make our use of these APIs less error prone, and verifiably correct at build time.

If you consider the standard C library malloc/realloc/calloc/free APIs, they in fact positively encourage application coding errors. Off the top of my head there are at least 7 common problems, probably more....

malloc() - pass incorrect number of bytes
malloc() - fail to check the return value
malloc() - forget to fully initialize returned memory
free() - double free by not setting pointer to NULL
realloc() - fail to check the return value
realloc() - fail to re-assign the pointer with new address
realloc() - leaking memory upon failure to resize

A great many libraries will create wrapper functions around malloc/free/realloc but they generally only attempt to address one or two of these points. As an example, consider GLib, which has a wrapper around malloc() attempting to address point 2. It does this by making it call abort() on failure to allocate, but then re-introduces the risk by also adding a wrapper which doesn't abort()

  gpointer g_malloc         (gulong        n_bytes) G_GNUC_MALLOC;
  gpointer g_try_malloc     (gulong        n_bytes) G_GNUC_MALLOC;

It also wraps realloc() for the same reason, and adds an annotation to make the compiler warn if you don't use the return value

  gpointer g_realloc        (gpointer      mem,
                             gulong        n_bytes) G_GNUC_WARN_UNUSED_RESULT;
  gpointer g_try_realloc    (gpointer      mem,
                             gulong        n_bytes) G_GNUC_WARN_UNUSED_RESULT;

This at least addresses point 6, ensuring that you update your variable with the new pointer, but can't protect against failure to check for NULL. And the free() wrapper doesn't address the double-free issue at all.

You can debate which of "checking for NULL" vs "calling abort()" is the better approach - Havoc has some compelling points - but that's not the point of this blog posting. I was interested in whether I could create wrappers for libvirt which kept the choice in the hands of the caller, while still protecting from the common risks.

For any given pointer type the compiler knows how many bytes need to be allocated for a single instance of it - it sizeof(datatype) or sizeof(*mptr). Both options are a little tedious to write out, eg
```
mytype *foo;
foo = malloc(sizeof(*foo));
foo = malloc(sizeof(mytype));
    
```
Since C has no data type representing a data type, you cannot simply pass a data type to a function as you might like to:
```
mytype *foo = malloc(mytype);
    
```
You do, however, have access to the preprocessor and so you can achieve the same effect via a trivial macro.
```
#define MALLOC(ptr) malloc(sizeof(*(ptr)))
or
#define MALLOC(mytype) malloc(sizeof(mytype))
    
```
GCC has a number of ways to annotate APIs, one of which is __attribute__((__warn_unused_result__)). This causes emit a warning if return value is not used / checked. Add -Werror and you can get guarenteed compile time verification (the rest of your code was 100% warning free, wasn't it - it ought to be :-). This doesn't actually help with the NULL check for malloc, because you always do something with the return value - assign it to a variable. The core problem here is that the API encodes the error condition as a special case value in the returned "data". The obvious answer to this is to separate the two cases, keeping the return value soley as a bi-state success of fail flag. The data can be passed by via a output parameter. This might look like:
```
myptr *foo;
mymalloc(&foo, sizeof(*foo));
    
```
And the compiler would be able to warn that you failed to check for failure. It looks rather ugly to write this way, but consider that a preprocessor macro is already being used for the previous issue, so we can merely extend its use
```
#define MALLOC(ptr) mymalloc(&(foo), sizeof(*(foo))
myptr *foo;
if (MALLOC(foo) < 0)
   ... do error handling...
    
```
The memory allocated by malloc() is not initialized to zero. For that a separate calloc() function is provided. This leaves open the possiblity of an app mistakenly using the wrong variant. Or consider an app using malloc() and explicitly initializing all struct members. Some time later a new member is added and now the developer needs to find all malloc() calls wrt to that struct and explicitly initilize the new member. It would be safer to always have malloc zero all memory - even though it has an obvious performance impact, the net win in safety is probably worth it for most applications. Dealing with this in realloc() is much harder though, because you don't know how many extra bytes (if any) you need to initialize without knowing the original size.
Since free() takes the pointer to be freed directly it cannot reset the caller's original pointer variable back to NULL. So the application is at risk of mistakenly calling free() a second time, leading to corruption or a crash. free() will happily ignore a NULL pointer though. If free() were to accept a pointer to a pointer instead, it could automatically NULL-ify it. Again this has extra computational cost from the often redundant assignment to NULL, but it is negligable in the context of the work done to actually free the memory region, and easily offset by the safety benefits.
As with point 2, this is caused by the fact that the error condition is special case of the return data value.
This is a fun one which may escape detection for ages because most of the time the returned pointer address is the same as the original address. realloc() doesn't often need to relocate the data to another address. Given that solving the previous point required changing the return data to be an output-parameter, we've already basically solved this issue to. Pass in a pointer to the pointer to be reallocated and it can update the address directly.
What todo on failure of a realloc() is an interesting question. The chances are that most applications will immediately free() the block and go into cleanup code - indeed I don't recall coming across code that would ever continue its work if realloc() failed. Thus one could simply define that if realloc fails it automatically free's the original data too.

Having considered this all its possible to define a set of criteria for the design of a low level memory allocation API that is considerably safer than the standard C one, while still retaining nearly all its flexibility and avoiding the imposition of policy such as calling abort() on failure.

Use return values only for a success/fail error condition flag
Annotate all the return values with __warn_unused_result__
Pass a pointer to the pointer into all functions
Define macros around all functions to automatically calculate datatype sizes

So the primary application usage would be via a set of macros:

    VIR_ALLOC(ptr)
    VIR_ALLOC_N(ptr, count)
    VIR_REALLOC_N(ptr, count)
    VIR_FREE(ptr)

These call through to the underlying APIs:

    int virAlloc(void *ptrptr, size_t bytes)
    int virAllocN(void *ptrptr, size_t bytes, size_t count)
    int virReallocN(void *ptrptr, size_t bytes, size_t count)
    int virFree(void *ptrptr)

The only annoying thing here is that although we're passing a pointer to a pointer into all these, the first param is still just 'void *' and not 'void **'. This works because 'void *' is defined to hold any type of pointer, and in addition using 'void **' causes the compiler to complain bitterly about strict aliasing violations. Internally the impls of these methods can still safely cast to 'void **' when deferencing the pointer.

All 3 of Alloc/Realloc functions will have __warn_unused_result__ annotation so the caller is forced to check the return value for failure, validated at compile time generating a warning (or fatal compile error with -Werror).

And finally to wire up the macros to the APIs:

  #define VIR_ALLOC(ptr)            virAlloc(&(ptr), sizeof(*(ptr)))
  #define VIR_ALLOC_N(ptr, count)   virAllocN(&(ptr), sizeof(*(ptr)), (count))
  #define VIR_REALLOC_N(ptr, count) virReallocN(&(ptr), sizeof(*(ptr)), (count))
  #define VIR_FREE(ptr)             virFree(&(ptr))

If this is all sounding fairly abstract, an illustration of usage should clear things up. These snippets are taken from libvirt, showing before and after

Allocating a new variable:

Before

      virDomain *domain;
      if ((domain = malloc(sizeof(*domain))) == NULL) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
      }

After

      virDomain *domain;


      if (VIR_ALLOC(domain) < 0) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
      }

Allocating an array of domains

Before

       virDomain *domains;
       int ndomains = 10;

       if ((domains = malloc(sizeof(*domains) * ndomains)) == NULL) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }

After

       virDomain *domains;
       int ndomains = 10;

       if (VIR_ALLOC_N(domains, ndomains) < 0) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }

Allocating an array of domain pointers

Before

       virDomain **domains;
       int ndomains = 10;

       if ((domains = malloc(sizeof(*domains) * ndomains)) == NULL) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }

After

       virDomain **domains;
       int ndomains = 10;

       if (VIR_ALLOC_N(domains, ndomains) < 0) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }

Re-allocate the array of domains to be longer

Before

       virDomain **tmp;
       ndomains = 20

       if ((tmp = realloc(domains, sizeof(*domains) * ndomains)) == NULL {
         free(domains);
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }
       domains = tmp;

After

       ndomains = 20

       if (VIR_REALLOC_N(domains, ndomains) < 0) {
         __virRaiseError(VIR_ERROR_NO_MEMORY)
         return NULL;
       }

Free the domain

Before

       free(domain);
       domain = NULL;

After

       VIR_FREE(domain);

As these short examples show the number of lines of code hasn't changed much, but the clarity of them has - particularly the realloc() usage, and of course there is now compile time verification of usage. The main problem remaining is the classic memory leak, by forgetting to call free() at all. If you want to use a low level memory allocation API this problem is essentially unavoidable. Fixing it really requires a completely different type of API (eg, obstacks/pools) or a garbage collector. And there's always valgrind to identify leaks which works very nicely, particularly if you have extensive test suite coverage

The original mailing thread from which this post is derived is on the libvirt mailing list

[linuxkernelnewbies] berrange.com: Diary: libvirt development

libvirt User Mode Linux driver and other new features

Friday, July 25, 2008

kernel-xen is dead. Long live kernel + paravirt_ops

Thursday, June 26, 2008

New Java bindings for libvirt

Wednesday, June 18, 2008

Red Hat Summit 2008

Friday, May 23, 2008

Better living through API design: low level memory allocation

Reply via email to