On 17/12/2025 21:21, Peter Xu wrote: > On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote: >> On 17/12/2025 00:26, Peter Xu wrote: >>> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote: >>>> "Chuang Xu" <[email protected]> writes: >>>> >>>>> From: xuchuangxclwt <[email protected]> >>>>> >>>>> In our long-term experience in Bytedance, we've found that under >>>>> the same load, live migration of larger VMs with more devices is >>>>> often more difficult to converge (requiring a larger downtime limit). >>>>> >>>>> Through some testing and calculations, we conclude that bitmap sync time >>>>> affects the calculation of live migration bandwidth. >>> Side note: >>> >>> I forgot to mention when replying to the old versions, but we introduced >>> avail-switchover-bandwidth to partially remedy this problem when we hit it >>> before - which may or may not be exactly the same reason here on unaligned >>> syncs as we didn't further investigate (we have VFIO-PCI devices when >>> testing), but the whole logic should be similar that bw was calculated too >>> small. >> In bytedance, we also migrate vms with vfio devices, which also suffer from >> the issue of long vfio bitmap sync time for large vm. >>> So even if with this patch optimizing sync, bw is always not as accurate. >>> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too >>> short a period to take samples, or at least we should be able to remember >>> more samples so the reported bw (even if we keep sampling per 100ms) will >>> cover longer period. >>> >>> Feel free to share your thoughts if you have any. >>> >> FYI: >> Initially, when I encountered the problem of large vm migration hard to >> converge, >> I tried subtracting the bitmap sync time from the bandwidth calculation, >> which alleviated the problem somewhat. However, through formula calculation, >> I found that this did not completely solve the problem. Therefore, I > If you ruled out sync time, why the bw is still not accurate? Have you > investigated that? > > Maybe there's something else happening besides the sync period you > excluded.
Referring to the formula I wrote in the cover, after subtracting sync time, we get the prerequisite that R=B. Substituting this condition into the subsequent formula derivation(B * t = D * (x + t) and R * y > D * (x + t)), we will eventually get y > D * x / (B - D). This means that even if our bandwidth calculations are correct, the sync time can still affect our judgment of downtime conditions. >> decided to >> conduct specific analysis for specific scenario to minimize the bitmap >> sync time >> as much as possible. >>>>> When the addresses processed are not aligned, a large number of >>>>> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate >>>>> 2048 clear_dirty ioctls from two different memory_listener), >>>>> which increases the time required for bitmap_sync and makes it >>>>> more difficult for dirty pages to converge. >>>>> >>>>> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and >>>>> 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms* >>>>> (tested with 10GBps dirty rate, the sync time increases as the dirty >>>>> page rate increases), Here are each part of the sync time: >>>>> >>>>> - sync from kvm to ram_list: 2.5ms >>>>> - vhost_log_sync:3ms >>>>> - sync aligned memory from ram_list to RAMBlock: 5ms >>>>> - sync misaligned memory from ram_list to RAMBlock: 61ms >>>>> >>>>> Attempt to merge those fragmented clear_dirty ioctls, then syncing >>>>> misaligned memory from ram_list to RAMBlock takes only about 1ms, >>>>> and the total sync time is only *12ms*. >>>>> >>>>> Signed-off-by: Chuang Xu <[email protected]> >>>>> --- >>>>> accel/tcg/cputlb.c | 5 ++-- >>>>> include/system/physmem.h | 7 +++--- >>>>> migration/ram.c | 17 ++++---------- >>>>> system/memory.c | 2 +- >>>>> system/physmem.c | 49 ++++++++++++++++++++++++++++------------ >>>>> 5 files changed, 47 insertions(+), 33 deletions(-) >>>>> >>>>> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c >>>>> index fd1606c856..c8827c8b0d 100644 >>>>> --- a/accel/tcg/cputlb.c >>>>> +++ b/accel/tcg/cputlb.c >>>>> @@ -857,8 +857,9 @@ void >>>>> tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu, >>>>> void tlb_protect_code(ram_addr_t ram_addr) >>>>> { >>>>> physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK, >>>>> - TARGET_PAGE_SIZE, >>>>> - DIRTY_MEMORY_CODE); >>>>> + TARGET_PAGE_SIZE, >>>>> + DIRTY_MEMORY_CODE, >>>>> + NULL); >>>>> } >>>>> >>>>> /* update the TLB so that writes in physical page 'phys_addr' are no >>>>> longer >>>>> diff --git a/include/system/physmem.h b/include/system/physmem.h >>>>> index 879f6eae38..8eeace9d1f 100644 >>>>> --- a/include/system/physmem.h >>>>> +++ b/include/system/physmem.h >>>>> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned >>>>> long *bitmap, >>>>> >>>>> void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t >>>>> length); >>>>> >>>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> - ram_addr_t length, >>>>> - unsigned client); >>>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> + ram_addr_t length, >>>>> + unsigned client, >>>>> + unsigned long *dest); >>>>> >>>>> DirtyBitmapSnapshot * >>>>> physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr >>>>> offset, >>>>> diff --git a/migration/ram.c b/migration/ram.c >>>>> index 29f016cb25..a03c9874a2 100644 >>>>> --- a/migration/ram.c >>>>> +++ b/migration/ram.c >>>>> @@ -942,7 +942,6 @@ static uint64_t >>>>> physical_memory_sync_dirty_bitmap(RAMBlock *rb, >>>>> ram_addr_t start, >>>>> ram_addr_t length) >>>>> { >>>>> - ram_addr_t addr; >>>>> unsigned long word = BIT_WORD((start + rb->offset) >> >>>>> TARGET_PAGE_BITS); >>>>> uint64_t num_dirty = 0; >>>>> unsigned long *dest = rb->bmap; >>>>> @@ -996,17 +995,11 @@ static uint64_t >>>>> physical_memory_sync_dirty_bitmap(RAMBlock *rb, >>>>> } else { >>>>> ram_addr_t offset = rb->offset; >>>>> >>>>> - for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) { >>>>> - if (physical_memory_test_and_clear_dirty( >>>>> - start + addr + offset, >>>>> - TARGET_PAGE_SIZE, >>>>> - DIRTY_MEMORY_MIGRATION)) { >>>>> - long k = (start + addr) >> TARGET_PAGE_BITS; >>>>> - if (!test_and_set_bit(k, dest)) { >>>>> - num_dirty++; >>>>> - } >>>>> - } >>>>> - } >>>>> + num_dirty = physical_memory_test_and_clear_dirty( >>>>> + start + offset, >>>>> + length, >>>>> + DIRTY_MEMORY_MIGRATION, >>>>> + dest); >>>>> } >>>>> >>>>> return num_dirty; >>>>> diff --git a/system/memory.c b/system/memory.c >>>>> index 8b84661ae3..666364392d 100644 >>>>> --- a/system/memory.c >>>>> +++ b/system/memory.c >>>>> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, >>>>> hwaddr addr, >>>>> { >>>>> assert(mr->ram_block); >>>>> physical_memory_test_and_clear_dirty( >>>>> - memory_region_get_ram_addr(mr) + addr, size, client); >>>>> + memory_region_get_ram_addr(mr) + addr, size, client, NULL); >>>>> } >>>>> >>>>> int memory_region_get_fd(MemoryRegion *mr) >>>>> diff --git a/system/physmem.c b/system/physmem.c >>>>> index c9869e4049..f8b660dafe 100644 >>>>> --- a/system/physmem.c >>>>> +++ b/system/physmem.c >>>>> @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t >>>>> start, ram_addr_t length, >>>>> } >>>>> } >>>>> >>>>> -/* Note: start and end must be within the same ram block. */ >>>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> +/* >>>>> + * Note: start and end must be within the same ram block. >>>>> + * >>>>> + * @dest usage: >>>> I'm not sure if it's just me, but I find this "dest" term quite >>>> confusing. "bmap" might be more straight-forward. >>>> >>>>> + * - When @dest is provided, set bits for newly discovered dirty pages >>>>> + * only if the bit wasn't already set in dest, and count those pages >>>>> + * in num_dirty. >>>> Am I misreading the code? I don't see this "set ... only if the bit >>>> wasn't already set" part. What I see is: "set bits, but only count those >>>> pages if the bit wasn't already set". >>> Agrees on both points.. one more thing to mention below. >> This will be fixed in next version. >>>>> + * - When @dest is NULL, count all dirty pages in the range >>>>> + * >>>>> + * @return: >>>>> + * - Number of dirty guest pages found within [start, start + length). >>>>> + */ >>>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> ram_addr_t length, >>>>> - unsigned client) >>>>> + unsigned client, >>>>> + unsigned long *dest) >>>>> { >>>>> DirtyMemoryBlocks *blocks; >>>>> unsigned long end, page, start_page; >>>>> - bool dirty = false; >>>>> + uint64_t num_dirty = 0; >>>>> RAMBlock *ramblock; >>>>> uint64_t mr_offset, mr_size; >>>>> >>>>> if (length == 0) { >>>>> - return false; >>>>> + return 0; >>>>> } >>>>> >>>>> end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS; >>>>> @@ -1118,12 +1130,19 @@ bool >>>>> physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> while (page < end) { >>>>> unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE; >>>>> unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE; >>>>> - unsigned long num = MIN(end - page, >>>>> - DIRTY_MEMORY_BLOCK_SIZE - offset); >>>>> >>>>> - dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx], >>>>> - offset, num); >>>>> - page += num; >>>>> + if (bitmap_test_and_clear_atomic(blocks->blocks[idx], >>>>> offset, 1)) { >>>>> + if (dest) { >>>>> + unsigned long k = page - (ramblock->offset >> >>>>> TARGET_PAGE_BITS); >>>>> + if (!test_and_set_bit(k, dest)) { >>>>> + num_dirty++; >>>>> + } >>>>> + } else { >>>>> + num_dirty++; >>>>> + } >>>>> + } >>>>> + >>>>> + page++; >>> Sorry I could have mentioned this in the previous version: IMHO it'll still >>> be nice to keep the one atomic operations for !dest/!bmap case over "num". >>> There's no reason we need to introduce even any slightest regression in >>> those paths. >>> >>> Thanks, >> bitmap_test_and_clear_atomic returns bool, not the number of bits cleared. >> So for !bmap case we can only return whether there is dirty, not the number >> of dirty, and this might be a bit confusing. > Ah, right.. > > Looks like we only have two real users of this API that clears more than > one target page (tcx_reset, qemu_ram_resize), I assume they're not perf > critical as of now. When it comes, it should be easy to optimize. > > Unless others have concerns, IMHO we can go with the current one until > later. Feel free to ignore this comment. > > Thanks, > >>>>> } >>>>> >>>>> mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - >>>>> ramblock->offset; >>>>> @@ -1131,18 +1150,18 @@ bool >>>>> physical_memory_test_and_clear_dirty(ram_addr_t start, >>>>> memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, >>>>> mr_size); >>>>> } >>>>> >>>>> - if (dirty) { >>>>> + if (num_dirty) { >>>>> physical_memory_dirty_bits_cleared(start, length); >>>>> } >>>>> >>>>> - return dirty; >>>>> + return num_dirty; >>>>> } >>>>> >>>>> static void physical_memory_clear_dirty_range(ram_addr_t addr, >>>>> ram_addr_t length) >>>>> { >>>>> - physical_memory_test_and_clear_dirty(addr, length, >>>>> DIRTY_MEMORY_MIGRATION); >>>>> - physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA); >>>>> - physical_memory_test_and_clear_dirty(addr, length, >>>>> DIRTY_MEMORY_CODE); >>>>> + physical_memory_test_and_clear_dirty(addr, length, >>>>> DIRTY_MEMORY_MIGRATION, NULL); >>>>> + physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, >>>>> NULL); >>>>> + physical_memory_test_and_clear_dirty(addr, length, >>>>> DIRTY_MEMORY_CODE, NULL); >>>>> } >>>>> >>>>> DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
