On Thu 27 Jun 2019 07:08:29 PM CEST, Denis Lunev wrote: > But can we get a link to the repo with actual version of patches.
Hi, I updated my code to increase the L2 entry size from 64 bits to 128 bits and thanks to this we now have 32 subclusters per cluster (32 bits for "subcluster allocated" and 32 for "subcluster is all zeroes"). I also fixed a few bugs on the way and started to clean the code a bit so it is more readable. You can get it here: https://github.com/bertogg/qemu/releases/tag/subcluster-allocation-prototype-20190711 The idea is that you can test it, evaluate the performance and see whether the general approach makes sense, but this is obviously not release-quality code so don't focus too much on the coding style, variable names, hacks, etc. Many things need to change, other things still need to be implemented, and I'm already on the process of doing it. Some questions that are still open: - It is possible to configure very easily the number of subclusters per cluster. It is now hardcoded to 32 in qcow2_do_open() but any power of 2 would work (just change the number there if you want to test it). Would an option for this be worth adding? - We could also allow the user to choose 64 subclusters per cluster and disable the "all zeroes" bits in that case. It is quite simple in terms of lines of code but it would make the qcow2 spec a bit more complicated. - We would now have "all zeroes" bits at the cluster and subcluster levels, so there's an ambiguity here that we need to solve. In particular, what happens if we have a QCOW2_CLUSTER_ZERO_ALLOC cluster but some bits from the bitmap are set? Do we ignore them completely? I also ran some I/O tests using a similar scenario like last time (SSD drive, 40GB backing image). Here are the results, you can see the difference between the previous prototype (8 subclusters per cluster) and the new one (32): |--------------+----------------+---------------+-----------------| | Cluster size | 32 subclusters | 8 subclusters | subclusters=off | |--------------+----------------+---------------+-----------------| | 4 KB | 80 IOPS | 101 IOPS | 92 IOPS | | 8 KB | 108 IOPS | 299 IOPS | 417 IOPS | | 16 KB | 3440 IOPS | 7555 IOPS | 3347 IOPS | | 32 KB | 10718 IOPS | 13038 IOPS | 2435 IOPS | | 64 KB | 12569 IOPS | 10613 IOPS | 1622 IOPS | | 128 KB | 11444 IOPS | 4907 IOPS | 866 IOPS | | 256 KB | 9335 IOPS | 2618 IOPS | 561 IOPS | | 512 KB | 185 IOPS | 1678 IOPS | 353 IOPS | | 1024 KB | 2477 IOPS | 863 IOPS | 212 IOPS | | 2048 KB | 1536 IOPS | 571 IOPS | 123 IOPS | |--------------+----------------+---------------+-----------------| I'm surprised about the 256 KB cluster / 32 subclusters case (I would expect ~3300 IOPS), but I ran it a few times and the results are always the same. I still haven't investigated why that happens. The rest of the results seem more or less normal. I will now continue working towards having something a complete solution, but any feedback or comments will be very welcome. Regards, Berto