Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images

Alberto Garcia Fri, 28 Jun 2019 06:45:26 -0700

On Thu 27 Jun 2019 06:54:34 PM CEST, Kevin Wolf wrote:
>> |-----------------+----------------+-----------------|
>> |  Cluster size   | subclusters=on | subclusters=off |
>> |-----------------+----------------+-----------------|
>> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
>> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
>> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
>> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
>> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
>> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
>> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
>> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
>> |-----------------+----------------+-----------------|
>
> So at the first sight, if you compare the numbers in the same row,
> subclusters=on is a clear winner.


Yes, as expected.

> But almost more interesting is the observation that at least for large
> cluster sizes, subcluster size X performs almost identical to cluster
> size X without subclusters:

But that's also to be expected, isn't it? The only difference (in terms
of I/O) between allocating a 64KB cluster and a 64KB subcluster is how
the L2 entry is updated. The amount of data that is read and written is
the same.

> Something interesting happens in the part that you didn't benchmark
> between 4 KB and 32 KB (actually, maybe it has already started for the
> 32 KB case): Performance collapses for small cluster sizes, but it
> reaches record highs for small subclusters.

I dind't measure that initially because I thought that having
subclusters < 4KB was not very useful. The 512b case was just to see how
it would perform on the extreme case. I anyway decided to get the rest
of the numbers too, so here's the complete table with the missing rows:

|---------+------------+----------------+-----------------|
| Cluster | Subcluster | subclusters=on | subclusters=off |
|---------+------------+----------------+-----------------|
|    2048 |        256 |            571 |             124 |
|    1024 |        128 |            863 |             212 |
|     512 |         64 |           1678 |             365 |
|     256 |         32 |           2618 |             568 |
|     128 |         16 |           4907 |             873 |
|      64 |          8 |          10613 |            1680 |
|      32 |          4 |          13038 |            2476 |
|      16 |          2 |           7555 |            3389 |
|       8 |          1 |            299 |             420 |
|       4 |       512b |            101 |             101 |
|---------+------------+----------------+-----------------|

> I suspect that this is because L2 tables are becoming very small with
> 4 KB clusters, but they are still 32 KB if 4 KB is only the subcluster
> size.

Yes, I explained that in my original proposal from 2017. I didn't
actually investigate further, but my take is that 4KB clusters require
constant allocations and refcount updates, plus L2 tables fill up very
quickly.

> (By the way, did the L2 cache cover the whole disk in your
> benchmarks?)

Yes, in all cases (I forgot to mention that, sorry).

> I think this gives us two completely different motivations why
> subclusters could be useful, depending on the cluster size you're
> using:
>
> 1. If you use small cluster sizes like 32 KB/4 KB, then obviously you
>    can get IOPS rates during cluster allocation that you couldn't come
>    even close to before. I think this is a quite strong argument in
>    favour of the feature.

Yes, indeed. You would need to select the subcluster size so it matches
the size of guest I/O requests (the size of the filesystem block is
probably the best choice).

> 2. With larger cluster sizes, you don't get a significant difference
>    in the performance during cluster allocation compared to just using
>    the subcluster size as the cluster size without having
>    subclusters. Here, the motivation could be something along the
>    lines of avoiding fragmentation. This would probably need more
>    benchmarks to check how fragmentation affects the performance after
>    the initial write.
>
>    This one could possibly be a valid justification, too, but I think it
>    would need more work on demonstrating that the effects are real and
>    justify the implementation and long-term maintenance effort required
>    for subclusters.

I agree. However another benefit of large cluster sizes is that you
reduce the amount of metadata, so you get the same performance with a
smaller L2 cache.

>> I also ran some tests on a rotating HDD drive. Here having
>> subclusters doesn't make a big difference regardless of whether there
>> is a backing image or not, so we can ignore this scenario.
>
> Interesting, this is kind of unexpected. Why would avoided COW not
> make a difference on rotating HDDs? (All of this is cache=none,
> right?)

, the 32K/4K with no COW is obviously much faster 

>
>> === Changes to the on-disk format ===
>> 
>> In my original proposal I described 3 different alternatives for
>> storing the subcluster bitmaps. I'm naming them here, but refer to
>> that message for more details.
>> 
>> (1) Storing the bitmap inside the 64-bit entry
>> (2) Making L2 entries 128-bit wide.
>> (3) Storing the bitmap somewhere else
>> 
>> I used (1) for this implementation for simplicity, but I think (2) is
>> probably the best one.
>
> Which would give us 32 bits for the subclusters, so you'd get 128k/4k or
> 2M/64k. Or would you intend to use some of these 32 bits for something
> different?
>
> I think (3) is the worst because it adds another kind of metadata table
> that we have to consider for ordering updates. So it might come with
> more frequent cache flushes.
>
>> ===========================
>> 
>> And I think that's all. As you can see I didn't want to go much into
>> the open technical questions (I think the on-disk format would be the
>> main one), the first goal should be to decide whether this is still an
>> interesting feature or not.
>> 
>> So, any questions or comments will be much appreciated.
>
> It does like very interesting to me at least for small subcluster sizes.
>
> For the larger ones, I suspect that the Virtuozzo guys might be
> interested in performing more benchmarks to see whether it improves the
> fragmentation problems that they have talked about a lot. It might end
> up being interesting for these cases, too.
>
> Kevin

Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images

Reply via email to