On Thu 27 Jun 2019 06:54:34 PM CEST, Kevin Wolf wrote: >> |-----------------+----------------+-----------------| >> | Cluster size | subclusters=on | subclusters=off | >> |-----------------+----------------+-----------------| >> | 2 MB (256 KB) | 571 IOPS | 124 IOPS | >> | 1 MB (128 KB) | 863 IOPS | 212 IOPS | >> | 512 KB (64 KB) | 1678 IOPS | 365 IOPS | >> | 256 KB (32 KB) | 2618 IOPS | 568 IOPS | >> | 128 KB (16 KB) | 4907 IOPS | 873 IOPS | >> | 64 KB (8 KB) | 10613 IOPS | 1680 IOPS | >> | 32 KB (4 KB) | 13038 IOPS | 2476 IOPS | >> | 4 KB (512 B) | 101 IOPS | 101 IOPS | >> |-----------------+----------------+-----------------| > > So at the first sight, if you compare the numbers in the same row, > subclusters=on is a clear winner.
Yes, as expected. > But almost more interesting is the observation that at least for large > cluster sizes, subcluster size X performs almost identical to cluster > size X without subclusters: But that's also to be expected, isn't it? The only difference (in terms of I/O) between allocating a 64KB cluster and a 64KB subcluster is how the L2 entry is updated. The amount of data that is read and written is the same. > Something interesting happens in the part that you didn't benchmark > between 4 KB and 32 KB (actually, maybe it has already started for the > 32 KB case): Performance collapses for small cluster sizes, but it > reaches record highs for small subclusters. I dind't measure that initially because I thought that having subclusters < 4KB was not very useful. The 512b case was just to see how it would perform on the extreme case. I anyway decided to get the rest of the numbers too, so here's the complete table with the missing rows: |---------+------------+----------------+-----------------| | Cluster | Subcluster | subclusters=on | subclusters=off | |---------+------------+----------------+-----------------| | 2048 | 256 | 571 | 124 | | 1024 | 128 | 863 | 212 | | 512 | 64 | 1678 | 365 | | 256 | 32 | 2618 | 568 | | 128 | 16 | 4907 | 873 | | 64 | 8 | 10613 | 1680 | | 32 | 4 | 13038 | 2476 | | 16 | 2 | 7555 | 3389 | | 8 | 1 | 299 | 420 | | 4 | 512b | 101 | 101 | |---------+------------+----------------+-----------------| > I suspect that this is because L2 tables are becoming very small with > 4 KB clusters, but they are still 32 KB if 4 KB is only the subcluster > size. Yes, I explained that in my original proposal from 2017. I didn't actually investigate further, but my take is that 4KB clusters require constant allocations and refcount updates, plus L2 tables fill up very quickly. > (By the way, did the L2 cache cover the whole disk in your > benchmarks?) Yes, in all cases (I forgot to mention that, sorry). > I think this gives us two completely different motivations why > subclusters could be useful, depending on the cluster size you're > using: > > 1. If you use small cluster sizes like 32 KB/4 KB, then obviously you > can get IOPS rates during cluster allocation that you couldn't come > even close to before. I think this is a quite strong argument in > favour of the feature. Yes, indeed. You would need to select the subcluster size so it matches the size of guest I/O requests (the size of the filesystem block is probably the best choice). > 2. With larger cluster sizes, you don't get a significant difference > in the performance during cluster allocation compared to just using > the subcluster size as the cluster size without having > subclusters. Here, the motivation could be something along the > lines of avoiding fragmentation. This would probably need more > benchmarks to check how fragmentation affects the performance after > the initial write. > > This one could possibly be a valid justification, too, but I think it > would need more work on demonstrating that the effects are real and > justify the implementation and long-term maintenance effort required > for subclusters. I agree. However another benefit of large cluster sizes is that you reduce the amount of metadata, so you get the same performance with a smaller L2 cache. >> I also ran some tests on a rotating HDD drive. Here having >> subclusters doesn't make a big difference regardless of whether there >> is a backing image or not, so we can ignore this scenario. > > Interesting, this is kind of unexpected. Why would avoided COW not > make a difference on rotating HDDs? (All of this is cache=none, > right?) , the 32K/4K with no COW is obviously much faster > >> === Changes to the on-disk format === >> >> In my original proposal I described 3 different alternatives for >> storing the subcluster bitmaps. I'm naming them here, but refer to >> that message for more details. >> >> (1) Storing the bitmap inside the 64-bit entry >> (2) Making L2 entries 128-bit wide. >> (3) Storing the bitmap somewhere else >> >> I used (1) for this implementation for simplicity, but I think (2) is >> probably the best one. > > Which would give us 32 bits for the subclusters, so you'd get 128k/4k or > 2M/64k. Or would you intend to use some of these 32 bits for something > different? > > I think (3) is the worst because it adds another kind of metadata table > that we have to consider for ordering updates. So it might come with > more frequent cache flushes. > >> =========================== >> >> And I think that's all. As you can see I didn't want to go much into >> the open technical questions (I think the on-disk format would be the >> main one), the first goal should be to decide whether this is still an >> interesting feature or not. >> >> So, any questions or comments will be much appreciated. > > It does like very interesting to me at least for small subcluster sizes. > > For the larger ones, I suspect that the Virtuozzo guys might be > interested in performing more benchmarks to see whether it improves the > fragmentation problems that they have talked about a lot. It might end > up being interesting for these cases, too. > > Kevin