Hi everyone, over the past few weeks I have been testing the effects of reducing the size of the entries in the qcow2 L2 cache. This was briefly mentioned by Denis in the same thread where we discussed subcluster allocation back in April, but I'll describe here the problem and the proposal in detail.
=== Problem === In the qcow2 file format guest addresses are mapped to host addresses using the so-called L1 and L2 tables. The size of an L2 table is the same as the cluster size, therefore a larger cluster means more L2 entries in a table, and because of that an L2 table can map a larger portion of the address space (not only because it contains more entries, but also because the data cluster that each one of those entries points at is larger). There are two consequences of this: 1) If you double the cluster size of a qcow2 image then the maximum space needed for all L2 tables is divided by two (i.e. you need half the metadata). 2) If you double the cluster size of a qcow2 image then each one of the L2 tables will map four times as much disk space. With the default cluster size of 64KB, each L2 table maps 512MB of contiguous disk space. This table shows what happens when you change the cluster size: |--------------+------------------| | Cluster size | An L2 table maps | |--------------+------------------| | 512 B | 32 KB | | 1 KB | 128 KB | | 2 KB | 512 KB | | 4 KB | 2 MB | | 8 KB | 8 MB | | 16 KB | 32 MB | | 32 KB | 128 MB | | 64 KB | 512 MB | | 128 KB | 2 GB | | 256 KB | 8 GB | | 512 KB | 32 GB | | 1 MB | 128 GB | | 2 MB | 512 GB | |--------------+------------------| When QEMU wants to convert a guest address into a host address, it needs to read the entry from the corresponding L2 table. The qcow2 driver doesn't read those entries directly, it does it by loading the tables in the L2 cache so they can be kept in memory in case they are needed later. The problem here is that the L2 cache (and the qcow2 driver in general) always works with complete L2 tables: if QEMU needs a particular L2 entry then the whole cluster containing the L2 table is read from disk, and if the cache is full then a cluster worth of cached data has to be discarded. The consequences of this are worse the larger the cluster size is, not only because we're reading (and discarding) larger amounts of data, but also because we're using that memory in a very inefficient way. Example: with 1MB clusters each L2 table maps 128GB of contiguous virtual disk, so that's the granularity of our cache. If we're performing I/O in a 4GB area that overlaps two of those 128GB chunks, we need to have in the cache two complete L2 tables (2MB) even when in practice we're only using 32KB of those 2MB (32KB contain enough L2 entries to map the 4GB that we're using). === The proposal === One way to solve the problems described above is to decouple the L2 table size (which is equal to the cluster size) from the cache entry size. The qcow2 cache doesn't actually know anything about the data that it's loading, it just receives a disk offset and checks that it is properly aligned. It's perfectly possible to make it load data blocks smaller than a cluster. I already have a working prototype, and I was doing tests using a 4KB cache entry size. 4KB is small enough, it allows us to make a more flexible use of the cache, it's also a common file system block size and it can hold enough L2 entries to cover substantial amounts of disk space (especially with large clusters). |--------------+-----------------------| | Cluster size | 4KB of L2 entries map | |--------------+-----------------------| | 64 KB | 32 MB | | 128 KB | 64 MB | | 256 KB | 128 MB | | 512 KB | 256 MB | | 1 MB | 512 MB | | 2 MB | 1 GB | |--------------+-----------------------| Some results from my tests (using an SSD drive and random 4K reads): |-----------+--------------+-------------+---------------+--------------| | Disk size | Cluster size | L2 cache | Standard QEMU | Patched QEMU | |-----------+--------------+-------------+---------------+--------------| | 16 GB | 64 KB | 1 MB [8 GB] | 5000 IOPS | 12700 IOPS | | 2 TB | 2 MB | 4 MB [1 TB] | 576 IOPS | 11000 IOPS | |-----------+--------------+-------------+---------------+--------------| The improvements are clearly visible, but it's important to point out a couple of things: - L2 cache size is always < total L2 metadata on disk (otherwise this wouldn't make sense). Increasing the L2 cache size improves performance a lot (and makes the effect of these patches disappear), but it requires more RAM. - Doing random reads over the whole disk is probably not a very realistic scenario. During normal usage only certain areas of the disk need to be accessed, so performance should be much better with the same amount of cache. - I wrote a best-case scenario test (several I/O jobs each accesing a part of the disk that requires loading its own L2 table) and my patched version is 20x faster even with 64KB clusters. === What needs to change? === Not so much, fortunately: - The on-disk format does not need any change, qcow2 images remain the same. - The qcow2 cache driver needs almost no changes, the entry size is no longer assumed to be equal to the cluster size, and it has to be explicitly set. Other than that it remains the same (it can even be simplified). The QEMU qcow2 driver does need a few changes: - qcow2_get_cluster_offset() and get_cluster_table() simply need to be aware that they're not loading full L2 tables anymore. - handle_copied(), handle_alloc(), discard_single_l2() and zero_single_l2() only need to update the calculation of nb_clusters. - l2_allocate() and qcow2_update_snapshot_refcount() cannot load a full L2 table in memory at once, they need to loop over the sub-tables. Other than wrapping the core of those functions inside a loop I haven't detected any major problem. - We need a proper name for these sub-tables that we are loading now. I'm actually still struggling with this :-) I can't think of any name that is clear enough and not too cumbersome to use (L2 subtables? => Confusing. L3 tables? => they're not really that). === What about the refcount cache? === This proposal would not touch this at all, the assumption would remain that "refcount cache entry size = cluster size". =========================== I think I haven't forgotten anything. As I said I have a working prototype of this and if you like the idea I'd like to publish it soon. Any questions or comments will be appreciated. Thanks! Berto