Hi,

I have been trying to get a parallel (MPI) Fortran application run solely
with hugepages with limited success. The test machine is a 4 socket 24 core
AMD Opteron machine. When I link it against the libhugetlbfs and run more
than 4-5 processes of the same executable (with enough allocated
hugepages), it most of the time fails in the check_range_empty function and
gives me this:

libhugetlbfs: WARNING: Unable to verify address range 0x2e3d000 -
0x3000000.  Not empty?

This is usually (not always) followed by a

pop: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char
*) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk,
fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned
long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *
(sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) &&
((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)'
failed.


In the following, I try to run the parallel application with 16 processes
and 2 of them fails in the setup phase of libhugetlbfs and the whole
application stalls due to these failed processes.

$ /DOCS/opt/hugetlbfs/bin/hugectl --library-path /DOCS/opt/hugetlbfs/lib64
--heap --text --bss --data --force-preload mpirun -n 16 ./pop
libhugetlbfs: WARNING: Unable to verify address range 0x2e3d000 -
0x3000000.  Not empty?
libhugetlbfs: WARNING: Unable to verify address range 0x2e3d000 -
0x3000000.  Not empty?
pop: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char
*) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk,
fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned
long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *
(sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) &&
((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)'
failed.
pop: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char
*) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk,
fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned
long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *
(sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) &&
((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)'
failed.
^CCtrl-C caught... cleaning up processes


The same procedure works fine with a C application. I can reproduce this on
a different machine with different host/setup/kernel. Different MPI
implementations don't change the behavior.

Could this be an issue with libgfortran?

~Hakan
------------------------------------------------------------------------------
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel

Reply via email to