Hey folks

MTT is reporting a massive wave of hangs on master from last night - they all 
look like this:

libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot 
open shared object
file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot 
open shared object
file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot 
open shared object file:
No such file or directory
libibverbs: Warning: couldn't load driver 'usnic_verbs': 
libusnic_verbs-rdmav2.so: cannot open
shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': 
libipathverbs-rdmav2.so: cannot open shared
object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs0--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-1001@mpi012_0/41788/1/shared_mem_pool.mpi012 (the 
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

  Local host:  mpi012
  Space Requested: 134217736 B
  Space Available: 22097920 B
--------------------------------------------------------------------------[mpi012:28580]
create_and_attach: unable to create shared memory BTL coordinating structure :: 
size 134217728 

[warn] Epoll ADD(4) on fd 81 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 71 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 90 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 82 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 80 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 68 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 79 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 85 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 87 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 83 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 88 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 86 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 77 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 74 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 84 failed.  Old events were 0; read change was 0 
(none); write change was
1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 89 failed.  Old events were 0; read change was 0 
(none); write change was


I realize that this indicates a problem on the Cisco MTT cluster, but we should 
handle it better than to just poll endlessly until timeout, yes?
Ralph

Reply via email to