On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 20/5/25 20:28, Chenyi Qiang wrote:
>> A new state_change() helper is introduced for RamBlockAttribute
>> to efficiently notify all registered RamDiscardListeners, including
>> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
>> listener can dynamically DMA map/unmap shared pages based on conversion
>> types:
>> - For conversions from shared to private, the VFIO system ensures the
>>    discarding of shared mapping from the IOMMU.
>> - For conversions from private to shared, it triggers the population of
>>    the shared mapping into the IOMMU.
>>
>> Currently, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation. It would be a future work
>> to add more error handling or rollback mechanisms once conversion
>> failures are allowed. For example, in-place conversion of guest_memfd
>> could retry the unmap operation during the conversion from shared to
>> private. However, for now, keep the complex error handling out of the
>> picture as it is not required:
>>
>> - If a conversion request is made for a page already in the desired
>>    state, the helper simply returns success.
>> - For requests involving a range partially in the desired state, there
>>    is no such scenario in practice at present. Simply return error.
>> - If a conversion request is declined by other systems, such as a
>>    failure from VFIO during notify_to_populated(), the failure is
>>    returned directly. As for notify_to_discard(), VFIO cannot fail
>>    unmap/unpin, so no error is returned.
>>
>> Note that the bitmap status is updated before callbacks, allowing
>> listeners to handle memory based on the latest status.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qi...@intel.com>
>> ---
>> Change in v5:
>>      - Move the state_change() back to a helper instead of a callback of
>>        the class since there's no child for the RamBlockAttributeClass.
>>      - Remove the error handling and move them to an individual patch for
>>        simple management.
>>
>> Changes in v4:
>>      - Add the state_change() callback in PrivateSharedManagerClass
>>        instead of the RamBlockAttribute.
>>
>> Changes in v3:
>>      - Move the bitmap update before notifier callbacks.
>>      - Call the notifier callbacks directly in notify_discard/populate()
>>        with the expectation that the request memory range is in the
>>        desired attribute.
>>      - For the case that only partial range in the desire status, handle
>>        the range with block_size granularity for ease of rollback
>>        (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-
>> fb7a125cf...@redhat.com/)
>>
>> Changes in v2:
>>      - Do the alignment changes due to the rename to
>> MemoryAttributeManager
>>      - Move the state_change() helper definition in this patch.
>> ---
>>   include/system/ramblock.h    |   2 +
>>   system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 136 insertions(+)
>>
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index 09255e8495..270dffb2f3 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>>       QLIST_HEAD(, RamDiscardListener) rdl_list;
>>   };
>>   +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> +                                     uint64_t size, bool to_private);
> 
> Not sure about the "to_private" name. I'd think private/shared is
> something KVM operates with and here, in RamBlock, it is discarded/
> populated.

Make sense. To keep consistent, I will rename it as to_discard.

> 
>>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>>   void ram_block_attribute_destroy(RamBlockAttribute *attr);
>>   diff --git a/system/ram-block-attribute.c b/system/ram-block-
>> attribute.c
>> index 8d4a24738c..f12dd4b881 100644
>> --- a/system/ram-block-attribute.c
>> +++ b/system/ram-block-attribute.c
>> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const
>> RamDiscardManager *rdm,
>>                                              
>> ram_block_attribute_rdm_replay_cb);
>>   }
>>   +static bool ram_block_attribute_is_valid_range(RamBlockAttribute
>> *attr,
>> +                                               uint64_t offset,
>> uint64_t size)
>> +{
>> +    MemoryRegion *mr = attr->mr;
>> +
>> +    g_assert(mr);
>> +
>> +    uint64_t region_size = memory_region_size(mr);
>> +    int block_size = ram_block_attribute_get_block_size(attr);
> 
> It is size_t, not int.

Fixed this and all below. Thanks!

> 
>> +
>> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {
> 
> Does not the @size have to be aligned too?

Yes. Actually, the "start" and "size" are already do the alignment check
in kvm_convert_memory(). I doubt if we still need it here. Anyway, in
case of other users in the future, I'll add it.

> 
>> +        return false;
>> +    }
>> +    if (offset + size < offset || !size) {
> 
> This could be just (offset + size <= offset).
> (these overflow checks always blow up my little brain)

Modified.

> 
>> +        return false;
>> +    }
>> +    if (offset >= region_size || offset + size > region_size) {
> 
> Just (offset + size > region_size) should do.

Ditto.

> 
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute
>> *attr,
>> +                                                  uint64_t offset,
>> +                                                  uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        rdl->notify_discard(rdl, &tmp);
>> +    }
>> +}
>> +
>> +static int
>> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>> +                                        uint64_t offset, uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +    int ret = 0;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        ret = rdl->notify_populate(rdl, &tmp);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute
>> *attr,
>> +                                                   uint64_t offset,
>> +                                                   uint64_t size)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    /* We fake a shorter bitmap to avoid searching too far. */
> 
> What is "fake" about it? We truthfully check here that every bit in
> [first_bit, last_bit] is set.

Aha, you ask this question again :)
(https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112e...@intel.com/)

If it is really confusing, let me remove this comment in next version.

> 
>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                   first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +static bool
>> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
>> +                                     uint64_t offset, uint64_t size)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    /* We fake a shorter bitmap to avoid searching too far. */
>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> +                                     uint64_t size, bool to_private)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -1;
> 
> May be -EINVAL?

Modified.

> 
>> +    }
>> +
>> +    /* Already discard/populated */
>> +    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>> +         to_private) ||
>> +        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>> +         !to_private)) {
> 
> A tracepoint would be useful here imho.

[...]

> 
>> +        return 0;
>> +    }
>> +
>> +    /* Unexpected mixture */
>> +    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>> +         to_private) ||
>> +        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>> +         !to_private)) {
>> +        error_report("%s, the range is not all in the desired state: "
>> +                     "(offset 0x%lx, size 0x%lx), %s",
>> +                     __func__, offset, size,
>> +                     to_private ? "private" : "shared");
>> +        return -1;
> 
> -EBUSY?

Maybe also -EINVAL since it is due to the invalid provided mixture
range? But Anyway, according to the discussion in patch #10, I'll add
the support for this mixture scenario. No need to return the error.

> 
>> +    }
>> +
>> +    if (to_private) {
>> +        bitmap_clear(attr->bitmap, first_bit, nbits);
>> +        ram_block_attribute_notify_to_discard(attr, offset, size);
>> +    } else {
>> +        bitmap_set(attr->bitmap, first_bit, nbits);
>> +        ret = ram_block_attribute_notify_to_populated(attr, offset,
>> size);
>> +    }
> 
> and a successful tracepoint here may be?

Good suggestion! I'll add tracepoint in next version.

> 
>> +
>> +    return ret;
>> +}
>> +
>>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>>   {
>>       uint64_t bitmap_size;
> 


Reply via email to