On 6/6/2025 11:06 AM, JAEHOON KIM wrote:
On 6/6/2025 9:14 AM, Steven Sistare wrote:
On 6/6/2025 9:53 AM, Daniel P. Berrangé wrote:
On Thu, Jun 05, 2025 at 06:08:08PM -0500, Jaehoon Kim wrote:
When the source VM attempts to connect to the destination VM's Unix
domain socket(cpr.sock) during CPR transfer, the socket file might not
yet be exist if the destination side hasn't completed the bind
operation. This can lead to connection failures when running tests with
the qtest framework.

This sounds like a flawed test impl to me - whatever is initiating
the cpr operation on the source has done so prematurely - it should
ensure the dest is ready before starting the operation.

To address this, add cpr_validate_socket_path(), which wait for the
socket file to appear. This avoids intermittent qtest failures caused by
early connection attempts.

IMHO it is dubious to special case cpr in this way.

I agree with Daniel, and unfortunately it is not just a test issue;
every management framework that supports cpr-transfer must add logic to
wait for the cpr socket to appear in the target before proceeding.

This is analogous to waiting for the monitor socket to appear before
connecting to it.

- Steve

Thank you very much for your valuable review and feedback.

Just to clarify, the added cpr_validate_socket_path() function is not limited 
to the test framework.
It is part of the actual CPR implementation and is intended to ensure correct 
behavior in all cases, including outside of tests.

I mentioned the qtest failure simply as an example where this issue became 
apparent.

Yes, I understand that you understand :)
Are you willing to move your fix to the qtest?

- Steve

Signed-off-by: Jaehoon Kim <jh...@linux.ibm.com>
Reviewed-by: Jason J. Herne <jjhe...@linux.ibm.com>
---
  include/migration/cpr.h  |  1 +
  migration/cpr-transfer.c | 35 +++++++++++++++++++++++++++++++++++
  2 files changed, 36 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7561fc75ad..cc9384b4f9 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -23,6 +23,7 @@ MigMode cpr_get_incoming_mode(void);
  void cpr_set_incoming_mode(MigMode mode);
  bool cpr_is_incoming(void);
  +bool cpr_validate_socket_path(const char *path, Error **errp);
  int cpr_state_save(MigrationChannel *channel, Error **errp);
  int cpr_state_load(MigrationChannel *channel, Error **errp);
  void cpr_state_close(void);
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f140359c..3088ed323f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,33 @@
  #include "migration/vmstate.h"
  #include "trace.h"
  +#define CPR_MAX_RETRIES     50     /* Retry for up to 5 seconds */
+#define CPR_RETRY_DELAY_US  100000 /* 100 ms per retry */
+
+bool cpr_validate_socket_path(const char *path, Error **errp)
+{
+    struct stat st;
+    int retries = CPR_MAX_RETRIES;
+
+    do {
+        if (!stat(path, &st) && S_ISSOCK(st.st_mode)) {
+            return true;
+        }
+
+        if (errno == ENOENT) {
+            usleep(CPR_RETRY_DELAY_US);
+        } else {
+            error_setg_errno(errp, errno,
+                "Unable to check status of socket path '%s'", path);
+            return false;
+        }
+    } while (--retries > 0);
+
+    error_setg(errp, "Socket path '%s' not found after %d retries",
+                                            path, CPR_MAX_RETRIES);
+    return false;
+}
+
  QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
  {
      MigrationAddress *addr = channel->addr;
@@ -28,6 +55,14 @@ QEMUFile *cpr_transfer_output(MigrationChannel *channel, 
Error **errp)
          QIOChannel *ioc = QIO_CHANNEL(sioc);
          SocketAddress *saddr = &addr->u.socket;
  +        /*
+         * Verify that the cpr.sock Unix domain socket file exists and is ready
+         * before proceeding with the connection.
+         */
+        if (!cpr_validate_socket_path(addr->u.socket.u.q_unix.path, errp)) {
+            return NULL;
+        }
+
          if (qio_channel_socket_connect_sync(sioc, saddr, errp) < 0) {
              return NULL;
          }
--
2.49.0



With regards,
Daniel




Reply via email to