Bug#995599: libopenmpi3: segfault in mca_btl_vader.so on 32-bit arches

2021-10-13 Thread Jeff Squyres (jsquyres)
Thanks for the investigation and confirmation!

--
Jeff Squyres
jsquy...@cisco.com





Bug#995599: libopenmpi3: segfault in mca_btl_vader.so on 32-bit arches

2021-10-12 Thread Jeff Squyres (jsquyres)
I'm sorry, I just noticed that you replied 6 days ago, but I apparently wasn't 
notified by the Debian bug tracker.  :-(

Ok, so this is an MPI_Alltoall issue.  Does it use MPI_IN_PLACE?


On Wed, 06 Oct 2021 20:15:38 +0200 Drew Parsons  wrote:
> Source: openmpi
> Followup-For: Bug #995599
> 
> Not so simple to make a minimal test case I think.
> 
> all_to_all is defined in cpp/dolfinx/common/MPI.h in dolfinx source,
> and calls MPI_Alltoall from openmpi.
> 
> It's designed to use with graph::AdjacencyList from
> graph/AdjacencyList.h, and is called from
> compute_nonlocal_dual_graph() in mesh/graphbuild.cpp, where T is set
> to std::int64_t.
> 
> I tried grabbing dolfinx' all_to_all and use it with a pared down
> version of AdjacencyList.  But it's not triggering the segfault on an
> i386 chroot. Possibly because I haven't populated it with an actual
> graph so there's nothing to send with MPI_Alltoall.
> 
> 



-- 
Jeff Squyres
jsquy...@cisco.com



Bug#995599:

2021-10-04 Thread Jeff Squyres (jsquyres)
Can you provide a small reproducer of the issue?



Bug#947148: Open MPI and libltdl

2020-01-28 Thread Jeff Squyres (jsquyres)
I just replied on the upstream Open MPI issue 
(https://github.com/open-mpi/ompi/issues/7331#issuecomment-579337035):

I'm not sure about that test: if you get a valid handle value back from 
lt_dlopen(), is the value from lt_dlerror() relevant?  I.e., is the "handle" 
value you're getting back from lt_dlopen() invalid?  All we can tell from this 
test is that it's not NULL.

Specifically: I'm not sure that calling lt_dlerror() will return anything 
meaningful if there has been no error.

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#896861: More notes on Open MPI / sometimes "-l" issues

2018-09-07 Thread Jeff Squyres (jsquyres)
On Sep 7, 2018, at 1:29 AM, Alastair McKinstry  
wrote:
> 
>> I am using:
>> 
>> Autoconf 2.69
>> Automake 1.15 (*** you are using 1.15.1 -- do we think that makes a 
>> difference?)
>> Libtool 2.4.6
> 
> Something about Debian's libtool still implicated, then.

Ouch -- that could get tricky to track down.  :-(

Just for giggles: what version of (g)m4 do you have?  I'm using 1.4.17.

It would likely be difficult for me to try Debian's Libtool, but if you're 
using a different m4, that might impact Libtool's search-and-replace...?  I'd 
be happy to give it a whirl on my side (i.e., on RHEL and/or MacOS) with the 
full complement of exactly the (stock) Autotools versions you're using to see 
if that triggers the issue.

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#896861: More notes on Open MPI / sometimes "-l" issues

2018-09-06 Thread Jeff Squyres (jsquyres)
On Sep 6, 2018, at 2:19 PM, Adrian Bunk  wrote:
> 
> With thanks to Santiago Vila we finally figured out how to reproduce it.
> 
> From a failed build log:
> I: NOTICE: Log filtering will replace 'build/openmpi-LjweZK/openmpi-3.0.1' 
> with '<>'
> I: NOTICE: Log filtering will replace 'build/openmpi-LjweZK' with 
> '<>'
> 
> Note the -L in the random string.

Oh wow, so your tmpdir just happens to contain "-L" and that causes the problem?

...unfortunately, I'm unable to replicate this issue.  :-\

Are you able to cause this problem running by hand?

Here's what I tried:

-
$ cd /home/jsquyres/openmpi-releases
$ mkdir -p build/openmpi-LjweZK
$ cd build/openmpi-LjweZK
$ tar xf ../../openmpi-3.0.1.tar.bz2
$ cd openmpi-3.0.1
$ ./configure --prefix=$HOME/bogus |& tee config.out
$ make -j 32 |& tee make.out
-

I also tried with a VPATH build (same general recipe as above, but in a vpath 
subdir)

Neither of these resulted in the problem.

What's different / why can't I reproduce?

I am using:

Autoconf 2.69
Automake 1.15 (*** you are using 1.15.1 -- do we think that makes a difference?)
Libtool 2.4.6

I am also using gcc 7.3.0.

Here's the directory I built in:

/home/jsquyres/openmpi-releases/build/openmpi-LjweZK/openmpi-3.0.1

I'm *not* running on Debian -- I'm running on an RHEL machine -- but this seems 
like a path issue, not a distro issue, so I'm kinda hoping that that doesn't 
matter...

Any thoughts?

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#896861: More notes on Open MPI / sometimes "-l" issues

2018-04-27 Thread Jeff Squyres (jsquyres)
On Apr 27, 2018, at 2:02 PM, Alastair McKinstry  
wrote:
> 
>> If we're not able to get the CI build product, has anyone been able to 
>> reproduce the error manually?
> No one's caught it manually yet :-(

Computers are hard.

> Debian libtool has a bunch of patches applied
> (https://sources.debian.org/src/libtool/2.4.6-2.1/debian/patches/)
> but nothing has changed in libtool in years.

Ok.

Could it be a dependent library that is generating (somehow) a faulty .la file? 
 Then the Open MPI libtool would be reading that and somehow getting a blank 
library name...?  (That's a complete guess)

> Two notes: (1) We run the build system / make in parallel.

Should be ok.  We do parallel builds all the time.

> (2) When it
> fails, its always at this spot.
> Whats special about this point in the makefile?

It's Fortran...?

That's the only thing I can think of.  But honestly -- it's a fairly vanilla 
Automake-ized Makefile.am.


https://github.com/open-mpi/ompi/blob/master/ompi/mpi/fortran/use-mpi-ignore-tkr/Makefile.am

The only interesting thing in that Makefile.am is that we generate 2 files 
(mpi-ignore-tkr-sizeof.h and mpi-ignore-tkr-sizeof.f90), but those are clearly 
delineated as dependencies.

Maybe you can put a patch in that Makefile.am that causes libtool to be cat'ed 
(or even emailed to yourself -- hah!) so that we can potentially run it 
manually / trace it to see where this blank library name is coming from...?

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#896861: More notes on Open MPI / sometimes "-l" issues

2018-04-27 Thread Jeff Squyres (jsquyres)
On Apr 27, 2018, at 12:57 PM, Alastair McKinstry  
wrote:
> 
> Before configure the following is run:
> 
> (cd config && autom4te --language=m4sh opal_get_version.m4sh -o
> opal_get_version.sh)
> ./autogen.pl --force

Ah -- you *are* running autogen and re-bootstrapping our tarballs with your own 
versions of the GNU Autotools.

What versions are you running of m4, Autoconf, Automake, and Libtool?

Open MPI v3.0.x is built with m4 1.4.17, Autoconf 2.69, Automake 1.15, and 
Libtool 2.4.6.

> Then (in this case):
> 
> ./configure --build=powerpc64-linux-gnu --prefix=/usr 
>   --includedir=\${prefix}/include --mandir=\${prefix}/share/man 
> --infodir=\${prefix}/share/info 
>   --sysconfdir=/etc --localstatedir=/var --disable-silent-rules 
> --libdir=\${prefix}/lib/powerpc64-linux-gnu 
>   --libexecdir=\${prefix}/lib/powerpc64-linux-gnu --runstatedir=/run 
> --disable-maintainer-mode 
>   --disable-dependency-tracking --with-verbs 
> --with-pmix=/usr/lib/powerpc64-linux-gnu/pmix 
>  --with-jdk-dir=/usr/lib/jvm/default-java --enable-mpi-java 
> --enable-opal-btl-usnic-unit-tests --disable-wrapper-rpath 
>  --with-libevent=external --enable-mpi-thread-multiple --disable-silent-rules 
> --enable-mpi-cxx 
>   --with-hwloc=/usr/ --with-libltdl=/usr/ --with-devel-headers --with-slurm 
> --with-sge --without-tm
>  --disable-vt --sysconfdir=/etc/openmpi 
> --libdir=\${prefix}/lib/powerpc64-linux-gnu/openmpi/lib 
>   --includedir=\${prefix}/lib/powerpc64-linux-gnu/openmpi/include
> configure: WARNING: unrecognized options: --disable-maintainer-mode, 
> --enable-mpi-thread-multiple, --disable-vt
> checking for perl... perl

This all generally looks fine.

If it really is libtool that is inserting this errant -l without a following 
library name, I think we need to know what version of libtool (plus any 
Debian-specific patches?) you are using to bootstrap Open MPI's build process, 
and see/trace the libtool that is being used to see where that errant -l is 
coming from.

If we're not able to get the CI build product, has anyone been able to 
reproduce the error manually?

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#896861: More notes on Open MPI / sometimes "-l" issues

2018-04-27 Thread Jeff Squyres (jsquyres)
Thanks to Alastair for filing https://github.com/open-mpi/ompi/issues/5114 to 
inform us upstream of this issue.

When this problem occurs, can you tell:

1. Is the errant "-l" being added by libtool (i.e., at "make" time)?
2. Or is the errant "-l" being added by configure (i.e., at "./configure" time)?

Looking at the very bottom of one of the logs on this ticket 
(https://buildd.debian.org/status/fetch.php?pkg=openmpi=ppc64=3.0.1-9=1524654706=0)
 -- extra blank lines inserted below for just for clarity):

-
make[3]: Entering directory 
'/<>/ompi/mpi/fortran/use-mpi-ignore-tkr'

/bin/bash ../../../../libtool  --tag=FC   --mode=link gfortran 
-I../../../../ompi/include -I../../../../ompi/include -I../../../.. 
-I../../../..  -g -O2 -fdebug-prefix-map=/<>=. 
-fstack-protector-strong -O3 -version-info 40:0:0  -Wl,-z,relro   -L/usr//lib  
-o libmpi_usempi_ignore_tkr.la -rpath /usr/lib/powerpc64-linux-gnu/openmpi/lib 
mpi-ignore-tkr.lo mpi-ignore-tkr-sizeof.lo /<>/opal/libopen-pal.la 
-lrt -lm -lutil   -lhwloc  -levent -levent_pthreads

libtool: link: gfortran -shared  -fPIC  .libs/mpi-ignore-tkr.o 
.libs/mpi-ignore-tkr-sizeof.o   -Wl,-rpath -Wl,/<>/opal/.libs 
-Wl,-rpath -Wl,/usr/lib/powerpc64-linux-gnu/openmpi/lib -L/usr//lib 
/<>/opal/.libs/libopen-pal.so -lrt -lutil -lhwloc -levent 
-levent_pthreads -l -L/usr/lib/gcc/powerpc64-linux-gnu/7 
-L/usr/lib/gcc/powerpc64-linux-gnu/7/../../../powerpc64-linux-gnu 
-L/usr/lib/gcc/powerpc64-linux-gnu/7/../../../../lib -L/lib/powerpc64-linux-gnu 
-L/lib/../lib -L/usr/lib/powerpc64-linux-gnu -L/usr/lib/../lib 
-L/usr/lib/gcc/powerpc64-linux-gnu/7/../../.. -lgfortran -lm -lc -lgcc_s  -g 
-O2 -fstack-protector-strong -O3 -Wl,-z -Wl,relro   -pthread -Wl,-soname 
-Wl,libmpi_usempi_ignore_tkr.so.40 -o .libs/libmpi_usempi_ignore_tkr.so.40.0.0

/usr/bin/powerpc64-linux-gnu-ld: cannot find 
-l-L/usr/lib/gcc/powerpc64-linux-gnu/7

collect2: error: ld returned 1 exit status

make[3]: *** [Makefile:1866: libmpi_usempi_ignore_tkr.la] Error 1


I see:

/usr/bin/powerpc64-linux-gnu-ld: cannot find 
-l-L/usr/lib/gcc/powerpc64-linux-gnu/7

But the real problem looks like "-l" was added with either no library name, or 
a *blank* library name following it.  In the middle of the long "libtool:" line:


... -levent_pthreads -l -L/usr/lib/gcc/powerpc64-linux-gnu/7 ...


Note the "-l" hanging out there by itself.

I also note the "-levent_pthreads" immediately preceding it, which is the last 
argument from the Open MPI-issued compile line.

That means that libtool itself is adding the -l without a following library 
name (or erroneously blank following library name).

It might be worth checking the .la files that libtool is examining -- perhaps 
from elsewhere on the system / outside the source/build trees -- here to see if 
there are any unexpectedly-empty library names.  If those .la files got built 
incorrectly somehow, that could lead to link errors later like this...?

Is it possible to check the build product from your automated CI like this?

This is using the Open MPI-bootstrap-provided libtool, right (i.e., from the 
Open MPI 3.0.x tarball)?  I.e., you didn't invoke "autogen.pl" again to 
re-bootstrap the Open MPI build system, right?

-- 
Jeff Squyres
jsquy...@cisco.com



Bug#720419: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64

2013-10-02 Thread Jeff Squyres (jsquyres)
On Sep 30, 2013, at 11:05 AM, Sylvestre Ledru sylves...@debian.org wrote:

 Here are the options list:
 configure: running /bin/bash './configure'  CFLAGS=-DNDEBUG -g -O2
 -Wformat -Werror=format-security -finline-functions -fno-strict-aliasing
 -pthread CPPFLAGS= -I/usr//include   -I/usr/include/infiniband
 -I/usr/include/infiniband FFLAGS=-g -O2 LDFLAGS=  -L/usr//lib
 --enable-shared --disable-static  --prefix=/usr --with-mpi=open_mpi
 --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking

Hmm -- I'm confused here; it's not possible that you're getting an assertion 
failure with this configure line, for two reasons:

1. The assert() in question will only be compiled in if you --enable-debug on 
the configure command line.
2. You supplied -DNDEBUG in CFLAGS, which means you've disabled all assert()s

Can you verify that this is the correct configure line that you used to 
generate that error?  Or is something else going on?

 2. try the 1.7 branch using that same configuration
 
 The 1.6 series is reaching its planned end-of-life, so we are trying to 
 decide how important it is to chase this down - i.e., if you see the same 
 problem on Debian with 1.7, then this becomes far more important.
 Sure, I will do that asap.

Thanks.

 Do you have an eta for the 1.8 ? (if I remember correctly, 1.7 is a
 development release).

1.7 is a feature release.  OMPI 1.odd.x series are stable and tested; they're 
just not as time-tested out in the real world as OMPI 1.even.x series.

We're anticipating 1.8 will be out in early 2014.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#592326: [Pkg-openmpi-maintainers] Bug#592326: Failure of AZTEC test case run.

2010-09-03 Thread Jeff Squyres (jsquyres)
Adding pthread could fix something, but I'm a little dubious. It seems 
unlikely. 

You should probably contact the Aztec authors at this point. 

Sent from my PDA. No type good. 

On Sep 3, 2010, at 3:05 AM, Rachel Gordon rgor...@techunix.technion.ac.il 
wrote:

 Dear Jeff, Ralf and  Manuel
 
 There are some good news,
 I added -pthread  to both the compilation and link for running
 az_tutorial_with_MPI.f, and I also compiled aztec with -pthread
 Now the code runs O.K for np=1,2.
 
 Now bad news: when I try running with 3,4 or more processors I get a similar 
 error message:
 
 mpirun -np 3 sample
 
 [cluster:25805] *** Process received signal ***
 [cluster:25805] Signal: Segmentation fault (11)
 [cluster:25805] Signal code:  (128)
 [cluster:25805] Failing at address: (nil)
 [cluster:25805] [ 0] /lib/libpthread.so.0 [0x7fbe20cb5a80]
 [cluster:25805] [ 1] /shared/lib/libmpi.so.0 [0x7fbe221325f7]
 [cluster:25805] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7fbe22160a48]
 [cluster:25805] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
 [cluster:25805] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
 [cluster:25805] [ 5] sample(AZ_transform+0x1c3) [0x418372]
 [cluster:25805] [ 6] sample(az_transform_+0x84) [0x407943]
 [cluster:25805] [ 7] sample(MAIN__+0x19a) [0x407708]
 [cluster:25805] [ 8] sample(main+0x2c) [0x44e00c]
 [cluster:25805] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fbe209721a6]
 [cluster:25805] [10] sample [0x4073b9]
 [cluster:25805] *** End of error message ***
 --
 mpirun noticed that process rank 1 with PID 25805 on node cluster exited on 
 signal 11 (Segmentation fault).
 --
 
 When I try running on 4 4pcessors I get a double message (from 2 processors).
mpirun -np 4 sample
 
 [cluster:25946] *** Process received signal ***
 [cluster:25946] Signal: Segmentation fault (11)
 [cluster:25946] Signal code:  (128)
 [cluster:25946] Failing at address: (nil)
 [cluster:25947] *** Process received signal ***
 [cluster:25947] Signal: Segmentation fault (11)
 [cluster:25947] Signal code:  (128)
 [cluster:25947] Failing at address: (nil)
 [cluster:25946] [ 0] /lib/libpthread.so.0 [0x7f4ae4c6ba80]
 [cluster:25946] [ 1] /shared/lib/libmpi.so.0 [0x7f4ae60e85f7]
 [cluster:25946] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f4ae6116a48]
 [cluster:25946] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
 [cluster:25946] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
 [cluster:25947] [ 0] /lib/libpthread.so.0 [0x7f7dc5350a80]
 [cluster:25946] [ 5] sample(AZ_transform+0x1c3) [0x418372]
 [cluster:25946] [ 6] sample(az_transform_+0x84) [0x407943]
 [cluster:25946] [ 7] sample(MAIN__+0x19a) [0x407708]
 [cluster:25946] [ 8] sample(main+0x2c) [0x44e00c]
 [cluster:25946] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f4ae49281a6]
 [cluster:25946] [10] sample [0x4073b9]
 [cluster:25946] *** End of error message ***
 [cluster:25947] [ 1] /shared/lib/libmpi.so.0 [0x7f7dc67cd5f7]
 [cluster:25947] [ 2] /shared/lib/libmpi.so.0(PMPI_Wait+0x38) [0x7f7dc67fba48]
 [cluster:25947] [ 3] sample(md_wrap_wait+0x17) [0x41ccba]
 [cluster:25947] [ 4] sample(AZ_find_procs_for_externs+0x5bf) [0x4177e7]
 [cluster:25947] [ 5] sample(AZ_transform+0x1c3) [0x418372]
 [cluster:25947] [ 6] sample(az_transform_+0x84) [0x407943]
 [cluster:25947] [ 7] sample(MAIN__+0x19a) [0x407708]
 [cluster:25947] [ 8] sample(main+0x2c) [0x44e00c]
 [cluster:25947] [ 9] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f7dc500d1a6]
 [cluster:25947] [10] sample [0x4073b9]
 [cluster:25947] *** End of error message ***
 --
 mpirun noticed that process rank 1 with PID 25946 on node cluster exited on 
 signal 11 (Segmentation fault).
 --
 
 
 
 
 Attached is the file found in AZTEC named:  md_wrap_mpi_c.c
 This might give you some further hint.
 
 
 
 Rachel
 
  Dr.  Rachel Gordon
  Senior Research Fellow   Phone: +972-4-8293811
  Dept. of Aerospace Eng.Fax:   +972 - 4 - 8292030
  The Technion, Haifa 32000, Israel email: rgor...@tx.technion.ac.il
 
 
 On Thu, 2 Sep 2010, Ralf Wildenhues wrote:
 
 Hello Rachel, Jeff,
 
 * Rachel Gordon wrote on Thu, Sep 02, 2010 at 01:35:37PM CEST:
 The cluster I am trying to run on has only the openmpi MPI version.
 So, mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent
 to mpicc.openmpi
 
 I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
 The compilation and linkage stage ran with no problem:
 
 mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=20
 -DMAX_CHUNK_SIZE=20  -c -o az_tutorial_with_MPI.o
 az_tutorial_with_MPI.f
 mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec  -o sample
 
 Can you retry but this time add -pthread to both compile and link
 command?
 

Bug#592326: [Pkg-openmpi-maintainers] Bug#592326: Failure of AZTEC test case run.

2010-09-02 Thread Jeff Squyres (jsquyres)
If you're segv'ing in comm size, this usually means you are using the wrong 
mpi.h.  Ensure you are using ompi's mpi.h so that you get the right values for 
all the MPI constants. 

Sent from my PDA. No type good. 

On Sep 2, 2010, at 7:35 AM, Rachel Gordon rgor...@techunix.technion.ac.il 
wrote:

 Dear Manuel,
 
 Sorry, it didn't help.
 
 The cluster I am trying to run on has only the openmpi MPI version. So, 
 mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent to 
 mpicc.openmpi
 
 I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
 The compilation and linkage stage ran with no problem:
 
 
 mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=20 
 -DMAX_CHUNK_SIZE=20  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
 mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec  -o sample
 
 
 But again when I try to run 'sample' I get:
 
 mpirun -np 1 sample
 
 
 [cluster:24989] *** Process received signal ***
 [cluster:24989] Signal: Segmentation fault (11)
 [cluster:24989] Signal code: Address not mapped (1)
 [cluster:24989] Failing at address: 0x10098
 [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
 [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
 [0x7f50594ce34e]
 [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
 [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
 [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
 [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
 [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
 [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f5057cf31a6]
 [cluster:24989] [ 8] sample [0x407459]
 [cluster:24989] *** End of error message ***
 --
 mpirun noticed that process rank 0 with PID 24989 on node cluster exited on 
 signal 11 (Segmentation fault).
 --
 
 Thanks for your help and cooperation,
 Sincerely,
 Rachel
 
 
 
 On Wed, 1 Sep 2010, Manuel Prinz wrote:
 
 Hi Rachel,
 
 I'm not very familiar with Fortran, so I'm most likely of not too much
 help here. I added Jeff to CC, maybe he can shed some lights into this.
 
 Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon:
 package:  openmpi
 
 dpkg --search openmpi
 gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright
 gromacs-dev: /usr/lib/libmd_mpi_openmpi.la
 gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la
 gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi
 gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5
 gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0
 gromacs-dev: /usr/lib/libmd_mpi_openmpi.so
 gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so
 gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0
 gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi
 gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0
 gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian
 gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a
 gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi
 gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz
 gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la
 gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz
 gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a
 gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0
 gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so
 gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5
 gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la
 gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz
 gromacs-openmpi: /usr/share/doc/gromacs-openmpi
 gromacs-dev: /usr/lib/libmd_mpi_openmpi.a
 gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so
 gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5
 gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5
 gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a
 
 
 Dear support,
 I am trying to run a test case of AZTEC library named
 az_tutorial_with_MPI.f . The example uses gfortran + MPI. The
 compilation and linkage stage goes O.K., generating an executable
 'sample'. But when I try to run sample (on 1 or more
 processors) the run crushes immediately.
 
 The compilation and linkage stage is done as follows:
 
 gfortran -O  -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx
 -I../lib -DMAX_MEM_SIZE=16731136
 -DCOMM_BUFF_SIZE=20 -DMAX_CHUNK_SIZE=20  -c -o
 az_tutorial_with_MPI.o az_tutorial_with_MPI.f
 gfortran az_tutorial_with_MPI.o -O -L../lib -laztec  -lm -L/shared/lib
 -lgfortran -lmpi -lmpi_f77 -o sample
 
 Generally, when compiling programs for use with MPI, you should use the
 compiler wrappers which do all the magic. In Debian's case this is
 mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a
 try?
 
 The run:
 /shared/home/gordon/Aztec_lib.dir/appmpirun -np 1 sample
 
 [cluster:12046] *** Process received signal ***
 [cluster:12046] Signal: Segmentation fault (11)
 [cluster:12046] Signal code: Address not mapped (1)
 [cluster:12046] 

Bug#531522: mpicc segfaults when called by fakeroot

2009-06-08 Thread Jeff Squyres (jsquyres)
I'm afk atm and can't see the full bug so far - yes, we should definitely not 
be aling anyflvor of alloc in the malloc init hook (eg via stat). 

Is there a run-time way to tell if we're running under fakeroot?  We can 
certainly just disable the malloc init hook if it detects that its under 
fakeroot. 

Also, we use our own libltdl because we require advanced features that are not 
always avail in the sys-instaled ltdl (eg installed ver is too old). There 
should be no conflict from our ltfdl and the sys installed one. Disabling it 
can be good atscale (eg dozens of machines in a single mpi job - see faq) and 
for debugging. 

I'll be back in range in several hours (boarding a floight right now). 

-jms
Sent from my PDA.  No type good.

- Original Message -
From: Manuel Prinz deb...@pinguinkiste.de
To: Steve M. Robbins st...@sumost.ca
Cc: Jeff Squyres (jsquyres); 531...@bugs.debian.org 531...@bugs.debian.org; 
531...@bugs.debian.org 531...@bugs.debian.org; faker...@packages.debian.org 
faker...@packages.debian.org; li...@packages.debian.org 
li...@packages.debian.org
Sent: Sun Jun 07 17:51:14 2009
Subject: Re: mpicc segfaults when called by fakeroot

Hi Jeff and Steve,

thanks a lot for diving into it! It's very appreciated! (I was not able
to access a computer during the last two days, so sorry for being
unresponsive!)

Am Sonntag, den 07.06.2009, 11:04 -0500 schrieb Steve M. Robbins:
 I was able to avoid the segfault simply by ifdef'ing out this section
 (patch attached).  This should suffice in the short term for Debian on
 the theory that OpenMPI compatibility with fakeroot is more important
 than OpenMPI compatibility with OpenFabrics.

This is very hard to decide. Of course, we need Open MPI to work with
fakeroot, since our build system relies on that. There's no way around
that. As for OpenFabrics, probably most users will use MPI over fast
interconnects, so we really do need InfiniBand support as well. With the
transition in mind, I would consider disabling InfiniBand as a
short-term and temporary option.

Nevertheless, I will do some more tests tomorrow, hoping to find a less
drastic solution. Jeff's suggestion to disable libltdl sounds like a
reasonable thing. As it seems, we should probably disable it anyway
since Open MPI brings it's own copy and does not allow to build against
a version already installed on the system. Jeff, can you confirm that?

(Currently, the versions of libltdl of Open MPI and Debian seem to
differ. Though might not be the reason, it might mean some extra work
for the release and/or security team.)

 However, there is clearly a bad interaction between this code, eglibc,
 and fakeroot.  Hence the cc's to the various packages.

Thanks for putting them in the loop! I already sent a mail to the libc
maintainers a view days ago but did not test with a downgraded libc.

 I'm speculating that memory allocation while in the
 __malloc_initialize_hook is a bad thing.  Perhaps the stat() in
 fakeroot caused a memory allocation, whereas the regular stat() does
 not, as this code doesn't segfault in normal use.

This is what I had in mind as well.

Thanks for your work so far! I'm quite confident that we can sort it out
soon! :)

Best regards
Manuel