Let me try to understand this test: - you're simulating a 1GB memory limit via ulimit of virtual memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes. - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI app - OMPI is barfing in the ptmalloc allocator
Meaning: you're trying to allocate 1,000x memory than you're allowing in virtual memory -- so I guess part of this test depends on how much physical RAM you have, because you're limiting virtual memory, right? It's quite possible that the ptmalloc included in OMPI doesn't guard well against a failed mmap. FWIW, I've seen all kinds of random badness (not just with OMPI) when malloc/mmap/etc. start failing due to lack of memory. Do you get the same behavior if you disable ptmalloc in OMPI? (your IB large message bandwidth will suffer a bit, though) On Aug 29, 2013, at 12:01 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 28/08/13 19:36, Chris Samuel wrote: > >> With RHEL 6.4 gfortran it instead SEGV's straight away > > Using strace I can see a mmap(2) (called from malloc I presume) > failing just before the SEGV. > > Process 6799 detached > Process 6798 detached > Hello, world, I am 0 of 1 > [pid 6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) > [pid 6796] --- SIGSEGV (Segmentation fault) @ 0 (0) --- > [barcoo:06796] *** Process received signal *** > [barcoo:06796] Signal: Segmentation fault (11) > [barcoo:06796] Signal code: Address not mapped (1) > [barcoo:06796] Failing at address: 0x20078d708 > [pid 6796] mmap(NULL, 2097152, PROT_NONE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f75a5fed000 > [barcoo:06796] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500] > [barcoo:06796] [ 1] > /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982) > [0x7f77a68c2dd2] > [barcoo:06796] [ 2] > /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) > [0x7f77a68c3f42] > [barcoo:06796] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a] > [barcoo:06796] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea] > [barcoo:06796] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd] > [barcoo:06796] [ 6] ./gnumyhello_f90() [0x400d69] > [barcoo:06796] *** End of error message *** > [pid 6796] --- SIGSEGV (Segmentation fault) @ 0 (0) --- > [pid 6796] +++ killed by SIGSEGV (core dumped) +++ > > > The SEGV occurs (according to the gdb core dump I have) at the > second set_head() call in this code: > > /* check that one of the above allocation paths succeeded */ > if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) { > remainder_size = size - nb; > remainder = chunk_at_offset(p, nb); > av->top = remainder; > set_head(p, nb | PREV_INUSE | (av != &main_arena ? NON_MAIN_ARENA : 0)); > set_head(remainder, remainder_size | PREV_INUSE); > check_malloced_chunk(av, p, nb); > return chunk2mem(p); > } > > > The arguments to that function are: > > (gdb) print remainder > $1 = (struct malloc_chunk *) 0x2008e5700 > > (gdb) print remainder_size > $2 = 0 > > ANy ideas? > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlIex30ACgkQO2KABBYQAh8HmQCgjj7tReOfdubczho7x9poprM7 > 5CwAnRBlw2LHrVHQsu2M1W6qo2H2HOzb > =dasp > -----END PGP SIGNATURE----- > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/