Let me try to understand this test: 

- you're simulating a 1GB memory limit via ulimit of virtual memory ("ulimit -v 
$((1*1024*1024))"), or 1,048,576 bytes.
- you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI app
- OMPI is barfing in the ptmalloc allocator

Meaning: you're trying to allocate 1,000x memory than you're allowing in 
virtual memory -- so I guess part of this test depends on how much physical RAM 
you have, because you're limiting virtual memory, right?

It's quite possible that the ptmalloc included in OMPI doesn't guard well 
against a failed mmap.  FWIW, I've seen all kinds of random badness (not just 
with OMPI) when malloc/mmap/etc. start failing due to lack of memory.

Do you get the same behavior if you disable ptmalloc in OMPI?  (your IB large 
message bandwidth will suffer a bit, though)



On Aug 29, 2013, at 12:01 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 28/08/13 19:36, Chris Samuel wrote:
> 
>> With RHEL 6.4 gfortran it instead SEGV's straight away
> 
> Using strace I can see a mmap(2) (called from malloc I presume)
> failing just before the SEGV.
> 
> Process 6799 detached
> Process 6798 detached
> Hello, world, I am            0  of            1
> [pid  6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [barcoo:06796] *** Process received signal ***
> [barcoo:06796] Signal: Segmentation fault (11)
> [barcoo:06796] Signal code: Address not mapped (1)
> [barcoo:06796] Failing at address: 0x20078d708
> [pid  6796] mmap(NULL, 2097152, PROT_NONE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f75a5fed000
> [barcoo:06796] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
> [barcoo:06796] [ 1] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
>  [0x7f77a68c2dd2]
> [barcoo:06796] [ 2] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
> [0x7f77a68c3f42]
> [barcoo:06796] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
> [barcoo:06796] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
> [barcoo:06796] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
> [barcoo:06796] [ 6] ./gnumyhello_f90() [0x400d69]
> [barcoo:06796] *** End of error message ***
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [pid  6796] +++ killed by SIGSEGV (core dumped) +++
> 
> 
> The SEGV occurs (according to the gdb core dump I have) at the
> second set_head() call in this code:
> 
>  /* check that one of the above allocation paths succeeded */
>  if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
>    remainder_size = size - nb;
>    remainder = chunk_at_offset(p, nb);
>    av->top = remainder;
>    set_head(p, nb | PREV_INUSE | (av != &main_arena ? NON_MAIN_ARENA : 0));
>    set_head(remainder, remainder_size | PREV_INUSE);
>    check_malloced_chunk(av, p, nb);
>    return chunk2mem(p);
>  }
> 
> 
> The arguments to that function are:
> 
> (gdb) print remainder
> $1 = (struct malloc_chunk *) 0x2008e5700
> 
> (gdb) print remainder_size
> $2 = 0
> 
> ANy ideas?
> 
> cheers,
> Chris
> - -- 
> Christopher Samuel        Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/      http://twitter.com/vlsci
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIex30ACgkQO2KABBYQAh8HmQCgjj7tReOfdubczho7x9poprM7
> 5CwAnRBlw2LHrVHQsu2M1W6qo2H2HOzb
> =dasp
> -----END PGP SIGNATURE-----
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to