Be careful - ulimit's can differ between an interative shell launched with
rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the
environment of your batch script, and the environment of the processes launched
via mpirun. I've been burned by this before.
If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the
ulimit environment on a PBS/Torque batch setup will be governed by the ulimits
of pbs_mom, which in turn is governed by your init process and/or by any of
the ulimit commands in init.d/pbs-client.
The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in
your mpi-launched binary and check the size.
Chances are this isn't your problem, though, because usually the error messages
make it pretty clear that a memory lock failure has occurred.
Don Holmgren
Fermilab
On Mon, 16 Nov 2009, Martin Siegert wrote:
Hi Mark,
On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
I am running into problems when sending large messages (about
180000000 doubles) over IB. A fairly trivial example program is attached.
sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
set too low? (ulimit -l)
Good point.
By now I have played with all kinds of ulimits (the nodes have 16GB
of memory and 16GB of swap space - this program is not even coming close
to those limits). This is the current setting:
# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 139264
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 139264
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
... same error :-(
[[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error
polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id
199132400 opcode 549755813 vendor error 105 qp_idx 3
105 looks like it might be an errno to me:
#define ENOBUFS 105 /* No buffer space available */
regards, mark.
BTW: when using Intel-MPI (MPICH2) the program segfaults with
l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
transfer the data internally and multiply the variable count by 8
without checking whether the integer overflows ...
- Martin
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf