Hi Mark, On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote: >> I am running into problems when sending large messages (about >> 180000000 doubles) over IB. A fairly trivial example program is attached. > > sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK > set too low? (ulimit -l)
Good point. By now I have played with all kinds of ulimits (the nodes have 16GB of memory and 16GB of swap space - this program is not even coming close to those limits). This is the current setting: # ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ... same error :-( >> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error >> polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id >> 199132400 opcode 549755813 vendor error 105 qp_idx 3 > > 105 looks like it might be an errno to me: > #define ENOBUFS 105 /* No buffer space available */ > > regards, mark. BTW: when using Intel-MPI (MPICH2) the program segfaults with l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to transfer the data internally and multiply the variable count by 8 without checking whether the integer overflows ... - Martin _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
