Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-29 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 28/08/13 19:36, Chris Samuel wrote:

> With RHEL 6.4 gfortran it instead SEGV's straight away

Using strace I can see a mmap(2) (called from malloc I presume)
failing just before the SEGV.

Process 6799 detached
Process 6798 detached
 Hello, world, I am0  of1
[pid  6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
[pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[barcoo:06796] *** Process received signal ***
[barcoo:06796] Signal: Segmentation fault (11)
[barcoo:06796] Signal code: Address not mapped (1)
[barcoo:06796] Failing at address: 0x20078d708
[pid  6796] mmap(NULL, 2097152, PROT_NONE, 
MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f75a5fed000
[barcoo:06796] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
[barcoo:06796] [ 1] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
 [0x7f77a68c2dd2]
[barcoo:06796] [ 2] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
[0x7f77a68c3f42]
[barcoo:06796] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
[barcoo:06796] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
[barcoo:06796] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
[barcoo:06796] [ 6] ./gnumyhello_f90() [0x400d69]
[barcoo:06796] *** End of error message ***
[pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid  6796] +++ killed by SIGSEGV (core dumped) +++


The SEGV occurs (according to the gdb core dump I have) at the
second set_head() call in this code:

  /* check that one of the above allocation paths succeeded */
  if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
remainder_size = size - nb;
remainder = chunk_at_offset(p, nb);
av->top = remainder;
set_head(p, nb | PREV_INUSE | (av != &main_arena ? NON_MAIN_ARENA : 0));
set_head(remainder, remainder_size | PREV_INUSE);
check_malloced_chunk(av, p, nb);
return chunk2mem(p);
  }


The arguments to that function are:

(gdb) print remainder
$1 = (struct malloc_chunk *) 0x2008e5700

(gdb) print remainder_size
$2 = 0

ANy ideas?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIex30ACgkQO2KABBYQAh8HmQCgjj7tReOfdubczho7x9poprM7
5CwAnRBlw2LHrVHQsu2M1W6qo2H2HOzb
=dasp
-END PGP SIGNATURE-


[OMPI devel] Compilation error with OpenIPMI in ubntu 12.04

2013-08-29 Thread Rishi Kaundinya Mutnuru
Hi,
I have downladed OpenIPMI-2.0.20-rc1 and tried using for developing a hardware 
monitoring tool.
I am facing
 compilation issues on ubuntu 12.04 host. I am snipping the error below with 
the system details.
Highly appreciate your quick response as this critical for our work.



$uname -a
Linux 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 
x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 12.04.3 LTS
Release:12.04
Codename:   precise

jab@nunez-jab:~/OpenIPMI/OpenIPMI-2.0.20-rc1$
 gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc-4.6.real
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 
4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs 
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr 
--program-suffix=-4.6 --enable-shared --enable-linker-build-id 
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object 
--enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 
--with-tune=generic --enable-checking=release --build=x86_64-linux-gnu 
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

...

..
bin/bash ../libtool --tag=CC   --mode=link gcc -Wall -Wsign-compare 
-I../include -DIPMI_CHECK_LOCKS  -g -O2 -rdynamic ../unix/libOpenIPMIposix.la  
-o ipmi_sim ipmi_sim.o emu.o emu_cmd.o -lpopt libIPMIlanserv.la
libtool: link: gcc -Wall -Wsign-compare -I../include -DIPMI_CHECK_LOCKS -g -O2 
-rdynamic -o .libs/ipmi_sim ipmi_sim.o emu.o emu_cmd.o  
../unix/.libs/libOpenIPMIposix.so /usr/lib/x86_64-linux-gnu/libpopt.so 
./.libs/libIPMIlanserv.so -Wl,-rpath -Wl,/home/jab/OpenIPMI/opt/lib
ipmi_sim.o: In function `sleeper':
/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:968: undefined 
reference to `os_handler_alloc_waiter'
/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:974: undefined 
reference to `os_handler_waiter_wait'
/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:975: undefined 
reference to `os_handler_waiter_release'
ipmi_sim.o: In function `main':
/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:1190: undefined 
reference to `os_handler_alloc_waiter_factory'
collect2: ld returned 1 exit status
make[3]: *** [ipmi_sim] Error 1
make[3]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1'
make: *** [all] Error 2

Thanks,
Rishi




Re: [OMPI devel] Compilation error with OpenIPMI in ubntu 12.04

2013-08-29 Thread George Bosilca
Wrong mailing list, this one is for the development of OpenMPI. For OpenIPMI 
you should use openipmi-develo...@lists.sourceforge.net.

  George.


On Aug 29, 2013, at 08:16 , Rishi Kaundinya Mutnuru  
wrote:

> Hi,
> I have downladed OpenIPMI-2.0.20-rc1 and tried using for developing a 
> hardware monitoring tool.
> I am facing
> compilation issues on ubuntu 12.04 host. I am snipping the error below with 
> the system details.
> Highly appreciate your quick response as this critical for our work.
>  
>  
> 
> $uname -a
> Linux 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 
> x86_64 x86_64 GNU/Linux
>  
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 12.04.3 LTS
> Release:12.04
> Codename:   precise
>  
> jab@nunez-jab:~/OpenIPMI/OpenIPMI-2.0.20-rc1$ gcc -v
> Using built-in specs.
> COLLECT_GCC=/usr/bin/gcc-4.6.real
> COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
> Target: x86_64-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 
> 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs 
> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr 
> --program-suffix=-4.6 --enable-shared --enable-linker-build-id 
> --with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
> --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 
> --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
> --enable-libstdcxx-debug --enable-libstdcxx-time=yes 
> --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror 
> --with-arch-32=i686 --with-tune=generic --enable-checking=release 
> --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
> Thread model: posix
> gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>  
> …
> ….
> ..
> bin/bash ../libtool --tag=CC   --mode=link gcc -Wall -Wsign-compare 
> -I../include -DIPMI_CHECK_LOCKS  -g -O2 -rdynamic ../unix/libOpenIPMIposix.la 
>  -o ipmi_sim ipmi_sim.o emu.o emu_cmd.o -lpopt libIPMIlanserv.la
> libtool: link: gcc -Wall -Wsign-compare -I../include -DIPMI_CHECK_LOCKS -g 
> -O2 -rdynamic -o .libs/ipmi_sim ipmi_sim.o emu.o emu_cmd.o  
> ../unix/.libs/libOpenIPMIposix.so /usr/lib/x86_64-linux-gnu/libpopt.so 
> ./.libs/libIPMIlanserv.so -Wl,-rpath -Wl,/home/jab/OpenIPMI/opt/lib
> ipmi_sim.o: In function `sleeper':
> /home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:968: undefined 
> reference to `os_handler_alloc_waiter'
> /home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:974: undefined 
> reference to `os_handler_waiter_wait'
> /home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:975: undefined 
> reference to `os_handler_waiter_release'
> ipmi_sim.o: In function `main':
> /home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv/ipmi_sim.c:1190: undefined 
> reference to `os_handler_alloc_waiter_factory'
> collect2: ld returned 1 exit status
> make[3]: *** [ipmi_sim] Error 1
> make[3]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1/lanserv'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/home/jab/OpenIPMI/OpenIPMI-2.0.20-rc1'
> make: *** [all] Error 2
>  
> Thanks,
> Rishi
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-29 Thread Ralph Castain
I would guess the problem is that your memory restriction is causing a malloc 
failure based on this line:

> [pid  6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

and we probably don't protect against that failure as well as we should. I 
doubt we would issue another 1.6 release for it, though.


On Aug 28, 2013, at 9:01 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 28/08/13 19:36, Chris Samuel wrote:
> 
>> With RHEL 6.4 gfortran it instead SEGV's straight away
> 
> Using strace I can see a mmap(2) (called from malloc I presume)
> failing just before the SEGV.
> 
> Process 6799 detached
> Process 6798 detached
> Hello, world, I am0  of1
> [pid  6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [barcoo:06796] *** Process received signal ***
> [barcoo:06796] Signal: Segmentation fault (11)
> [barcoo:06796] Signal code: Address not mapped (1)
> [barcoo:06796] Failing at address: 0x20078d708
> [pid  6796] mmap(NULL, 2097152, PROT_NONE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f75a5fed000
> [barcoo:06796] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
> [barcoo:06796] [ 1] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
>  [0x7f77a68c2dd2]
> [barcoo:06796] [ 2] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
> [0x7f77a68c3f42]
> [barcoo:06796] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
> [barcoo:06796] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
> [barcoo:06796] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
> [barcoo:06796] [ 6] ./gnumyhello_f90() [0x400d69]
> [barcoo:06796] *** End of error message ***
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [pid  6796] +++ killed by SIGSEGV (core dumped) +++
> 
> 
> The SEGV occurs (according to the gdb core dump I have) at the
> second set_head() call in this code:
> 
>  /* check that one of the above allocation paths succeeded */
>  if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
>remainder_size = size - nb;
>remainder = chunk_at_offset(p, nb);
>av->top = remainder;
>set_head(p, nb | PREV_INUSE | (av != &main_arena ? NON_MAIN_ARENA : 0));
>set_head(remainder, remainder_size | PREV_INUSE);
>check_malloced_chunk(av, p, nb);
>return chunk2mem(p);
>  }
> 
> 
> The arguments to that function are:
> 
> (gdb) print remainder
> $1 = (struct malloc_chunk *) 0x2008e5700
> 
> (gdb) print remainder_size
> $2 = 0
> 
> ANy ideas?
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIex30ACgkQO2KABBYQAh8HmQCgjj7tReOfdubczho7x9poprM7
> 5CwAnRBlw2LHrVHQsu2M1W6qo2H2HOzb
> =dasp
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] NO LT_DLADVISE - CANNOT LOAD LIBOMPI JAVA BINDINGS

2013-08-29 Thread Bibrak Qamar
Hi all,

I have the following runtime error while running Java MPI jobs. I have
check the previous answers to the mailing list regarding this issue.

The solutions were to install libtool and configure-compile-and-install
openmpi again this time with the latest version of

m4
autoconfig
automake
libtools
and flex

I did all that but again the same issue that it can't load the libraries.
Any remedies?



-bash-3.2$ mpirun -np 2 java Hello
[compute-0-21.local:14205] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
[compute-0-21.local:14204] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[48748,1],1]
  Exit code:1
--


-Bibrak


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-29 Thread Jeff Squyres (jsquyres)
Let me try to understand this test: 

- you're simulating a 1GB memory limit via ulimit of virtual memory ("ulimit -v 
$((1*1024*1024))"), or 1,048,576 bytes.
- you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI app
- OMPI is barfing in the ptmalloc allocator

Meaning: you're trying to allocate 1,000x memory than you're allowing in 
virtual memory -- so I guess part of this test depends on how much physical RAM 
you have, because you're limiting virtual memory, right?

It's quite possible that the ptmalloc included in OMPI doesn't guard well 
against a failed mmap.  FWIW, I've seen all kinds of random badness (not just 
with OMPI) when malloc/mmap/etc. start failing due to lack of memory.

Do you get the same behavior if you disable ptmalloc in OMPI?  (your IB large 
message bandwidth will suffer a bit, though)



On Aug 29, 2013, at 12:01 AM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 28/08/13 19:36, Chris Samuel wrote:
> 
>> With RHEL 6.4 gfortran it instead SEGV's straight away
> 
> Using strace I can see a mmap(2) (called from malloc I presume)
> failing just before the SEGV.
> 
> Process 6799 detached
> Process 6798 detached
> Hello, world, I am0  of1
> [pid  6796] mmap(NULL, 8560001024, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [barcoo:06796] *** Process received signal ***
> [barcoo:06796] Signal: Segmentation fault (11)
> [barcoo:06796] Signal code: Address not mapped (1)
> [barcoo:06796] Failing at address: 0x20078d708
> [pid  6796] mmap(NULL, 2097152, PROT_NONE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f75a5fed000
> [barcoo:06796] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
> [barcoo:06796] [ 1] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
>  [0x7f77a68c2dd2]
> [barcoo:06796] [ 2] 
> /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
> [0x7f77a68c3f42]
> [barcoo:06796] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
> [barcoo:06796] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
> [barcoo:06796] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
> [barcoo:06796] [ 6] ./gnumyhello_f90() [0x400d69]
> [barcoo:06796] *** End of error message ***
> [pid  6796] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> [pid  6796] +++ killed by SIGSEGV (core dumped) +++
> 
> 
> The SEGV occurs (according to the gdb core dump I have) at the
> second set_head() call in this code:
> 
>  /* check that one of the above allocation paths succeeded */
>  if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
>remainder_size = size - nb;
>remainder = chunk_at_offset(p, nb);
>av->top = remainder;
>set_head(p, nb | PREV_INUSE | (av != &main_arena ? NON_MAIN_ARENA : 0));
>set_head(remainder, remainder_size | PREV_INUSE);
>check_malloced_chunk(av, p, nb);
>return chunk2mem(p);
>  }
> 
> 
> The arguments to that function are:
> 
> (gdb) print remainder
> $1 = (struct malloc_chunk *) 0x2008e5700
> 
> (gdb) print remainder_size
> $2 = 0
> 
> ANy ideas?
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIex30ACgkQO2KABBYQAh8HmQCgjj7tReOfdubczho7x9poprM7
> 5CwAnRBlw2LHrVHQsu2M1W6qo2H2HOzb
> =dasp
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] NO LT_DLADVISE - CANNOT LOAD LIBOMPI JAVA BINDINGS

2013-08-29 Thread Ralph Castain
you need to install the lt_dladvise package as well

On Aug 29, 2013, at 6:18 AM, Bibrak Qamar  wrote:

> Hi all,
> 
> I have the following runtime error while running Java MPI jobs. I have check 
> the previous answers to the mailing list regarding this issue. 
> 
> The solutions were to install libtool and configure-compile-and-install 
> openmpi again this time with the latest version of
> 
> m4
> autoconfig
> automake
> libtools
> and flex
> 
> I did all that but again the same issue that it can't load the libraries. Any 
> remedies?
> 
> 
> 
> -bash-3.2$ mpirun -np 2 java Hello 
> [compute-0-21.local:14205] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> [compute-0-21.local:14204] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[48748,1],1]
>   Exit code:1
> --
> 
> 
> -Bibrak
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-29 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Jeff, Ralph,

On 29/08/13 23:30, Jeff Squyres (jsquyres) wrote:

> Let me try to understand this test:
> 
> - you're simulating a 1GB memory limit via ulimit of virtual
> memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes.

Yeah, basically doing by hand what Torque/Slurm do by default for jobs
(unless the user asks for more).

When this happens for Dalton (compiled with the Intel compilers) it
just sits there spinning its wheels at start up.

> - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI 
> app

That was the developer trying to simulate the failure in Dalton.

> - OMPI is barfing in the ptmalloc allocator

Sounds like it.

> Meaning: you're trying to allocate 1,000x memory than you're
> allowing in virtual memory -- so I guess part of this test depends
> on how much physical RAM you have, because you're limiting virtual
> memory, right?

No, it only depends on the memory limits for the job in Slurm.

The reason for the test is that he was trying to see whether or not
those limits were successfully being propagated to MPI ranks or not in
Slurm (and it appears not).

However, in the process he found he could also replicate this
livelock/deadlock in Dalton.

> It's quite possible that the ptmalloc included in OMPI doesn't
> guard well against a failed mmap.  FWIW, I've seen all kinds of
> random badness (not just with OMPI) when malloc/mmap/etc. start
> failing due to lack of memory.

OK, so I'll try testing again with a larger limit to see if that will
ameliorate this issue.  I'm also wondering where this is happening in
OMPI, I've a sneaking suspicion this is at MPI_INIT().

> Do you get the same behavior if you disable ptmalloc in OMPI?
> (your IB large message bandwidth will suffer a bit, though)

Not tried that, but I'll take a look at it if it doesn't seem possible
to fix it with a change to the default memory limits (that'll be the
least intrusive).

Thanks!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIf2lMACgkQO2KABBYQAh/JrACfRKATdmD3hbSX0mHWtAt2cBP6
1wYAn31EjuS37inIaD151n1DxuAH4GAM
=yaYe
-END PGP SIGNATURE-


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-29 Thread Jeff Squyres (jsquyres)
On Aug 29, 2013, at 7:33 PM, Christopher Samuel  wrote:

> OK, so I'll try testing again with a larger limit to see if that will
> ameliorate this issue.  I'm also wondering where this is happening in
> OMPI, I've a sneaking suspicion this is at MPI_INIT().


FWIW, the stack traces you sent are not during MPI_INIT.

What happens with OMPI's memory manager is that it inserts itself to be *the* 
memory allocator for the entire process before main() even starts.  We have to 
do this as part of the horribleness of that is OpenFabrics/verbs and how it 
just doesn't match the MPI programming model at all.  :-(  (I think I wrote 
some blog entries about this a while ago...  Ah, here's a few:

http://blogs.cisco.com/performance/rdma-what-does-it-mean-to-mpi-applications/
http://blogs.cisco.com/performance/registered-memory-rma-rdma-and-mpi-implementations/

Or, more generally: http://blogs.cisco.com/tag/rdma/

Therefore, (in C) if you call malloc() before MPI_Init(), it'll be calling 
OMPI's ptmalloc.  The stack traces you sent imply that it's just when your app 
is calling the fortran allocate -- which is after MPI_Init().

FWIW, you can build OMPI with --without-memory-manager, or you can setenv 
OMPI_MCA_memory_linux_disable to 1 (note: this is NOT a regular MCA parameter 
-- it *must* be set in the environment before the MPI app starts).  If this env 
variable is set, OMPI will *not* interpose its own memory manager in the 
pre-main hook.  That should be a quick/easy way to try with and without the 
memory manager and see what happens.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/