Re: [PATCH v1] migration: Wait for cpr.sock file to appear before connecting

JAEHOON KIM Sun, 08 Jun 2025 15:03:13 -0700


On 6/6/2025 2:37 PM, Steven Sistare wrote:

On 6/6/2025 2:06 PM, JAEHOON KIM wrote:
On 6/6/2025 12:04 PM, Steven Sistare wrote:
On 6/6/2025 12:06 PM, Daniel P. Berrangé wrote:
On Fri, Jun 06, 2025 at 11:50:10AM -0400, Steven Sistare wrote:
On 6/6/2025 11:43 AM, Daniel P. Berrangé wrote:
On Fri, Jun 06, 2025 at 10:37:28AM -0500, JAEHOON KIM wrote:
On 6/6/2025 10:12 AM, Steven Sistare wrote:
On 6/6/2025 11:06 AM, JAEHOON KIM wrote:
On 6/6/2025 9:14 AM, Steven Sistare wrote:
On 6/6/2025 9:53 AM, Daniel P. Berrangé wrote:
On Thu, Jun 05, 2025 at 06:08:08PM -0500, Jaehoon Kim wrote:
When the source VM attempts to connect to the destinationVM's Unix
domain socket(cpr.sock) during CPR transfer, the socket
file might not
yet be exist if the destination side hasn't completed the bind
operation. This can lead to connection failures when
running tests with
the qtest framework.
This sounds like a flawed test impl to me - whatever isinitiatingthe cpr operation on the source has done so prematurely - itshould
ensure the dest is ready before starting the operation.
To address this, add cpr_validate_socket_path(), which waitfor the
socket file to appear. This avoids intermittent qtest
failures caused by
early connection attempts.
IMHO it is dubious to special case cpr in this way.
I agree with Daniel, and unfortunately it is not just a testissue;every management framework that supports cpr-transfer mustadd logic towait for the cpr socket to appear in the target beforeproceeding.
This is analogous to waiting for the monitor socket to appearbefore
connecting to it.

- Steve
Thank you very much for your valuable review and feedback.

Just to clarify, the added cpr_validate_socket_path() function is
not limited to the test framework.
It is part of the actual CPR implementation and is intended to
ensure correct behavior in all cases, including outside of tests.
I mentioned the qtest failure simply as an example where thisissue
became apparent.
Yes, I understand that you understand :)
Are you willing to move your fix to the qtest?

- Steve
Thank you for your question and feedback.
I agree that the issue could be addressed within the qtestframework to
improve stability.
However, this socket readiness check is a fundamental part ofCPR transfer
process,
and I believe that incorporating cpr_validate_socket_path()directly into
the CPR implementation helps ensure more reliable behavior
across all environments - not only during testing.
Just from my perspective, adding this logic to the CPR code doesnot
introduce significant overhead or side effects.
I would appreciate if you could share more details about yourconcerns, so I
can better address them.
Requiring a busy wait like this is a sign of a design problem.

There needs to be a way to setup the incoming socket listener
without resorting to busy waiting - that's showing a lack of
synchronization.
How is this a design problem? If I start a program that creates alistening unixdomain socket, I cannot attempt to connect to it until the socketis created andlistening. Clients face the same issue when starting qemu andconnecting to the
monitor socket.
Yes, the monitor has the same conceptual problem, and this causedproblems
for libvirt starting QEMU for many years.

With the busy wait you risk looping forever if the child (target) QEMU
already exited for some reason without ever creating the socket. You
can mitigate this by using 'kill($PID, 0)' in the loop and looking
for -ERSCH, but this only works if you know the pid involved.
One option is to use 'daemonize' such that when the parent sees theinitialQEMU process leader exit, the parent has a guarantee that thedaemonizedQEMU already has the UNIX listener open, and any failure indicatesQEMU
already quit.
The other option is to use FD passing such that QEMU is notresponsible
for opening the listener socket - it gets passed a pre-opened listener
FD, so the parent has a guarantee it can successfull connectimmediately
and any failure indicates QEMU already quit.

For the tests, passing a pre-opened UNIX socket FD could work, but I'm
still curious why this is only a problem for the CPR tests, and not
the other migration tests which don't use 'defer'. What has made CPR
special to expose a race ?
For normal migration, target qemu listens on the migration socket,then listenson the monitor. After the client connects to the monitor (waitingfor it to appear
as needed), them the migration socket already exists.
For cpr, target qemu creates the cpr socket and listens before themonitor isstarted, which is necessary because cpr state is needed beforebackends or
devices are created.
A few months back I sent a series to start the monitor first (Ithink I calledit the precreate phase), but it was derailed over discussions aboutallowingqemu to start with no arguments and be configured exclusively viathe monitor.
- Steve
Thank you for sharing your thoughts.

I agree that busy waiting is not ideal.
However, considering the timing of when target QEMU creates andbegins listening on the socket,I think there is currently no reliable way for the host to check thesocket's listening state.
This also implies that FD passing is not a viable option in this case.

As for the 'defer' option in qtest,
it doesn't cause race-condition issues because the target enters thelistening state during the option processing.
Of course, to address this issue,
I could create a wait_for_socket() function similar towait_for_serial() in qtest framework.Since the socket might already be created, I cannot simply wait forthe file to appear using file system notification APIs like inotify,
so busy-waiting would still be necessary.

I would appreciate hearing any further thoughts you might have on this.
The easiest solution, with no interface changes, is addingwait_for_socket() in qtest,
with this addition from Daniel:

  "With the busy wait you risk looping forever if the child (target) QEMU
  already exited for some reason without ever creating the socket. You
  can mitigate this by using 'kill($PID, 0)' in the loop and looking
  for -ERSCH, but this only works if you know the pid involved."

Daniel also suggested:
  "For the tests, passing a pre-opened UNIX socket FD could work"
Note we can not use any of the standard chardev options to specifysuch a socket,
because the cpr socket is created before chardevs are created.
Perhaps we could specify the fd in an extension of theMigrationChannel MigrationAddress.
{ 'union': 'MigrationAddress',
  'base': { 'transport' : 'MigrationAddressType'},
  'discriminator': 'transport',
  'data': {
    'socket': 'SocketAddress',
    'exec': 'MigrationExecCommand',
    'rdma': 'InetSocketAddress',
    'file': 'FileMigrationArgs',
    'fd':   'FdMigrationArgs' } }           <-- add this

That would be useful for all clients, but this is asking a lot from you,
when you are just trying to fix the tests.

- Steve

Thanks for the feedback and suggestions.

I'll update the patch to v2 accordingly.

- Jaehoon Kim

Re: [PATCH v1] migration: Wait for cpr.sock file to appear before connecting

Reply via email to