Hey folks MTT is reporting a massive wave of hangs on master from last night - they all look like this:
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'usnic_verbs': libusnic_verbs-rdmav2.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2 libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1 libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0-------------------------------------------------------------------------- It appears as if there is not enough space for /tmp/openmpi-sessions-1001@mpi012_0/41788/1/shared_mem_pool.mpi012 (the shared-memory backing file). It is likely that your MPI job will now either abort or experience performance degradation. Local host: mpi012 Space Requested: 134217736 B Space Available: 22097920 B --------------------------------------------------------------------------[mpi012:28580] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 [warn] Epoll ADD(4) on fd 81 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 71 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 90 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 82 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 80 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 68 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 79 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 85 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 87 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 83 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 88 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 86 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 77 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 74 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 84 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor [warn] Epoll ADD(4) on fd 89 failed. Old events were 0; read change was 0 (none); write change was I realize that this indicates a problem on the Cisco MTT cluster, but we should handle it better than to just poll endlessly until timeout, yes? Ralph