Hello, Chuang, On Fri, Feb 17, 2023 at 04:11:19PM +0800, Chuang Xu wrote: > Error 1 was triggered by our sanity check. I try to add RCU_READ_LOCK_GUARD() > in address_space_init() and it works. But I'm not sure if this code change is > appropriate. If this change is not appropriate, we may need to consider other > sanity check.
I'd suggest not adding RCU locks without a good reason. address_space_init() is definitely a special context because the AS is exclusively owned by the caller before it returns. It means no RCU protection needed at all because no one else is touching it; neither do we need qatomic_rcu_read() when read. So I suggest we directly reference current_map, even though that'll need a rich comment: static void address_space_set_flatview(AddressSpace *as) { - FlatView *old_view = address_space_to_flatview(as); + /* + * NOTE: we don't use RCU flavoured of address_space_to_flatview() + * because we exclusively own as->current_map here: it's either during + * init of an address space, or during commit() with BQL held. + */ + FlatView *old_view = as->current_map; We can have address_space_to_flatview_raw() but since we'll directly modify as->current_map very soon in the same function, so may not even bother. > > Error 2 was related to postcopy. I read the official document of postcopy > (I hope it is the latest) and learned that two threads will call > qemu_loadvm_state_main() in the process of postcopy. The one called by main > thread > will take the BQL, and the one called by ram_listen thread won't take the BQL. > The latter checks whether the BQL is held when calling > memory_region_transaction_commit(), > thus triggering the assertion. Creating a new function > qemu_loadvm_state_ram_listen() > without memory_region_transaction_commit() will solve this error. Sounds right, because the whole qemu_loadvm_state_main() process shouldn't load any device state or anything that requires BQL at all; in most cases that should be only RAM states leftovers. I think we only want to optimize precopy but not the postcopy phase. Note! it should include the phase when transferring precopy -> postcopy too, so it's covering postcopy, just not covering the "post" phase of migration - if you see that's a nested call to qemu_loadvm_state_main() with a whole MIG_CMD_PACKAGED package which is actually got covered, which is the real meat for postcopy on device transitions. So in short: instead of creating qemu_loadvm_state_ram_listen(), how about modifying your patch 3, instead of changing inside qemu_loadvm_state_main() we can do that for qemu_loadvm_state() only (so you can wrap the begin() and commit() over qemu_loadvm_state_main() there)? > > I don't know if you suggest using this patch in postcopy. If this patch is > applicable to > postcopy, considering the difference between how postcopy and precheck load > device state, > do we need to consider more details? See above. Yes I definitely hope postcopy will be covered too. Thanks! -- Peter Xu