Haren Myneni <ha...@linux.ibm.com> writes: > On Thu, 2022-09-22 at 07:14 -0500, Nathan Lynch wrote: >> Haren Myneni <ha...@linux.ibm.com> writes: >> > When the migration is initiated, the hypervisor changes VAS >> > mappings as part of pre-migration event. Then the OS gets the >> > migration event which closes all VAS windows before the migration >> > starts. NX generates continuous faults until windows are closed >> > and the user space can not differentiate these NX faults coming >> > from the actual migration. So to reduce this time window, close >> > VAS windows first in pseries_migrate_partition(). >> >> I'm concerned that this is only narrowing a window of time where >> undesirable faults occur, and that it may not be sufficient for all >> configurations. Migrations can be in progress for minutes or hours, >> while the time that we wait for the VASI state transition is usually >> seconds or minutes. So I worry that this works around a problem in >> limited cases but doesn't cover them all. >> >> Maybe I don't understand the problem well enough. How does user space >> respond to the NX faults? > > The user space resend the request to NX whenever the request is > returned with NX fault. So the process should be same even for faults > caused by the pre-migration. > > Whereas the paste will be returned with failure when the window is > closed (unmap the paste address) and it can be considered as NX busy. > Up to the user space whether to send the request again after some delay > or fall back to SW compression and send the request again later. > > For the migration, pre-migration event is notified to the hypervisor > and then OS will receive the migration event (SUSPEND) - So this patch > close windows early before VASI so that removing NX fault handling > during the time taken for VASI state transistion.
OK, so we can consider this a quality of implementation improvement that allows better behavior and less wasted retries for NX clients in a migration scenario, but there's not a correctness issue, really. With that clarified, I've confirmed that the slightly altered control flow and error handling in pseries_migrate_partition() look correct after your change. Reviewed-by: Nathan Lynch <nath...@linux.ibm.com>