Hello,

While simulating a large fabric using ibsim (roughly 3000 lines of topology, 50 x 36 port switches, 576 HCAs), I get the following errors:

sim_read_pkt: write failed: Resource temporarily unavailable - pkt dropped

The code is as follows (ibsim.c, function sim_read_pkt()):

       // reply
       ret = write(dcl->fd, buf, size);
       if (ret == size)
           return 0;

       if (ret < 0 && (errno == ECONNREFUSED || errno == ENOTCONN)) {
           IBWARN("client %u seems to be dead - disconnecting.",
                  dcl->id);
           disconnect_client(dcl->id);
       }
       IBWARN("write failed: %m - pkt dropped");

The error being thrown out here is EAGAIN and is not handled at all.

When I kill opensm after seeing these errors, I see that the MADs were not acknowledged by ibsim, e.g:

OpenSM: Got signal 2 - exiting...
There are still 51 MADs out. Forcing the exit of the OpenSM application...

To address this issue, I modified the code as follows:

--- ibsim.c.ORIG    2008-09-18 14:30:07.000000000 +0200
+++ ibsim.c    2008-09-18 15:37:55.000000000 +0200
@@ -481,6 +481,8 @@
        return -1;
    }
    for (;;) {
+        int retry_count = 0;
+
        if ((size = read(fd, buf, sizeof(buf))) <= 0)
            return size;

@@ -497,7 +499,14 @@
             size, sizeof(struct sim_request), dcl->id, dcl->fd);

        // reply
-        ret = write(dcl->fd, buf, size);
+        do {
+            ret = write(dcl->fd, buf, size);
+            if (retry_count && (ret != size)) {
+ IBWARN("failed to send reply: ret = %d, retry_count =%d, errno = %d.",
+                    ret, retry_count, errno);
+            }
+        } while ((retry_count++ < 20) && (ret == -1));
+ if (ret == size)
            return 0;

Basically, it simply retries 20 times before giving up (and I still get errors, although less).

The question is: Am I looking at the right thing here, or is the 'pkt dropped' error hiding another problem elsewhere ?
Note: both ibsim and opensm codes are pulled from the git head branch.

Thanks for your help,

Vincent
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to