ok, so a couple of things. I still think it is the same issue that I observed 1-2 days ago. Could you try to remove the fs/lustre component from your compilation, e.g. by adding an .ompi_ignore file into that directory, and see whether this fixes the issue?
I tried on my machine (no lustre, no ib) compilations with --disable-mpi-io *or* --disable-io-romio, and both worked correctly and I could run things. Note, that the flags are truly different meanwhile, since the second flag is now equivalent to --enable-mca-no-build=io:romio The first flag disables the io, fcoll, fs and sharedfp frameworks. (prior to ompio they had basically the same effect). In your particular case this means, that you disabled romio, but the entire ompio stack is still compiled, and error must come from that portion. If my suspecion is correct, it is still liblustre messing around with the malloc hooks, and that causes the stack frame to be completely broken. I thought I fixed that since we did not have the issue on trunk, but we did observe that in the 1.7 branch 1-2 days back as well, and I was looking into that. That being said, there is another malloc-hooks issue that makes me a bit nervous. The compilation of the otf stuff produced a ton of warnings on my machine with gcc4.6.2 also with respect to the _malloc_hooks and _realloc_hooks. Not sure whether this contributed to the problem as well, just thought I bring it up since we seem to have a corrupted stack frame problem. Thanks Edgar On 10/30/2012 8:29 AM, Edgar Gabriel wrote: > ok, I'll look into this. I noticed a problem with static builds on > lustre file systems recently, and I was wandering whether its the same > issue or not. But I'll check what's going on. > > THanks > Edgar > > On 10/30/2012 7:22 AM, Ralph Castain wrote: >> No to Lustre, and I didn't build static >> >> I'm not sure what, if any, parallel file system might be present. In the >> case that works, I just built with no configure args other than prefix. >> ompi_info shows both romio and mpio built, but nothing more about what >> support they built internally. >> >> >> On Oct 30, 2012, at 4:14 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >> >>> Ralph, >>> >>> just out curiosity: is there a lustre file system on the machine and is >>> this a static build ? >>> >>> Thanks >>> Edgar >>> >>> On 10/29/2012 9:17 PM, Ralph Castain wrote: >>>> Hmmm...I added that directory and tried this on odin (which is an IB-based >>>> machine). Any MPI proc segfaults: >>>> >>>> Core was generated by `./hello'. >>>> Program terminated with signal 11, Segmentation fault. >>>> w#0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574 >>>> 574 src/inode.c: No such file or directory. >>>> in src/inode.c >>>> (gdb) where >>>> #0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574 >>>> #1 0x00002aaaabd3f3e9 in _sysio_path_walk (parent=0x0, nd=0x7fffffffd8e0) >>>> at src/namei.c:216 >>>> #2 0x00002aaaabd3faad in _sysio_namei (parent=0x0, path=<value optimized >>>> out>, flags=0, intnt=0x7fffffffd950, pnop=0x7fffffffd970) at >>>> src/namei.c:505 >>>> #3 0x00002aaaabd3fd98 in open (path=0x2aaaac24280f >>>> "/sys/devices/system/node", flags=<value optimized out>) at src/open.c:179 >>>> #4 0x00002aaaabd43d5b in opendir (name=0x2aaaac24280f >>>> "/sys/devices/system/node") at src/stddir.c:60 >>>> #5 0x00002aaaac241825 in numa_max_node () from /usr/lib64/libnuma.so.1 >>>> #6 0x00002aaaac241d13 in numa_init () from /usr/lib64/libnuma.so.1 >>>> #7 0x00002aaaaaab845b in call_init () from /lib64/ld-linux-x86-64.so.2 >>>> #8 0x00002aaaaaab8565 in _dl_init_internal () from >>>> /lib64/ld-linux-x86-64.so.2 >>>> #9 0x00002aaaaaaabaaa in _dl_start_user () from >>>> /lib64/ld-linux-x86-64.so.2 >>>> #10 0x0000000000000001 in ?? () >>>> #11 0x00007fffffffe03c in ?? () >>>> #12 0x0000000000000000 in ?? () >>>> >>>> I got the same thing whether I excluded openib or not. I then ran on my >>>> Linux cluster, which doesn't have IB at all - and it ran fine. Also runs >>>> clean on the Mac. However, in both those cases, I had left IO romio >>>> enabled. >>>> >>>> Now on odin, I always disable-io-romio. So I tried deliberately enabling >>>> it, and everything works. So this appears to be something that the IO work >>>> has broken. >>>> >>>> Edgar: can you please fix --disable-io-romio? >>>> >>>> Thanks >>>> Ralph >>>> >>>> >>>> >>>> >>>> On Oct 29, 2012, at 11:55 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >>>> >>>>> I'm sorry to add one more thing to the list, but beyond this file, it >>>>> looks like also the entire ompi/mca/common/verbs/ directory is also >>>>> missing in the 1.7 branch, but is required to compile the bcoll >>>>> framework. It is there in the trunk, but missing in the 1.7 branch... >>>>> >>>>> Thanks >>>>> Edgar >>>>> >>>>> >>>>> On 10/26/2012 5:31 PM, Ralph Castain wrote: >>>>>> Okay, I'll fix for tonights tarball. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Oct 26, 2012, at 3:28 PM, "Shamis, Pavel" <sham...@ornl.gov> wrote: >>>>>> >>>>>>> There is a bug in makefile. The file existing in svn, but it is not >>>>>>> listed in the Makefile.am. As a result, it wasn't pulled to the tarball. >>>>>>> >>>>>>> Pavel (Pasha) Shamis >>>>>>> --- >>>>>>> Computer Science Research Group >>>>>>> Computer Science and Math Division >>>>>>> Oak Ridge National Laboratory >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Oct 26, 2012, at 2:33 PM, Edgar Gabriel wrote: >>>>>>> >>>>>>> we have trouble compiling the 1.7 series on a machine in Dresden. >>>>>>> Specifically, we receive an error message when compiling the >>>>>>> bcol/iboffload component (other infiniband components compile fine). >>>>>>> >>>>>>> Any idea/suggestions what we might be doing wrong or what to look for? >>>>>>> >>>>>>> make[2]: Entering directory >>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload' >>>>>>> CC bcol_iboffload_module.lo >>>>>>> CC bcol_iboffload_mca.lo >>>>>>> CC bcol_iboffload_endpoint.lo >>>>>>> CC bcol_iboffload_frag.lo >>>>>>> In file included from bcol_iboffload_frag.c:16:0: >>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such >>>>>>> file or directory >>>>>>> compilation terminated. >>>>>>> make[2]: *** [bcol_iboffload_frag.lo] Error 1 >>>>>>> make[2]: *** Waiting for unfinished jobs.... >>>>>>> In file included from bcol_iboffload_mca.c:18:0: >>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such >>>>>>> file or directory >>>>>>> compilation terminated. >>>>>>> make[2]: *** [bcol_iboffload_mca.lo] Error 1 >>>>>>> In file included from bcol_iboffload_endpoint.c:23:0: >>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such >>>>>>> file or directory >>>>>>> compilation terminated. >>>>>>> make[2]: *** [bcol_iboffload_endpoint.lo] Error 1 >>>>>>> In file included from bcol_iboffload_module.c:39:0: >>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such >>>>>>> file or directory >>>>>>> compilation terminated. >>>>>>> make[2]: *** [bcol_iboffload_module.lo] Error 1 >>>>>>> make[2]: Leaving directory >>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload' >>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>> make[1]: Leaving directory `/home/h2/gabriel/openmpi-1.7rc4/ompi' >>>>>>> make: *** [all-recursive] Error 1 >>>>>>> >>>>>>> Thanks >>>>>>> Edgar >>>>>>> >>>>>>> -- >>>>>>> Edgar Gabriel >>>>>>> Associate Professor >>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>>>> Department of Computer Science University of Houston >>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>>>> >>>>>>> <signature.asc>_______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org<mailto:de...@open-mpi.org> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> -- >>>>> Edgar Gabriel >>>>> Associate Professor >>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>> Department of Computer Science University of Houston >>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> -- >>> Edgar Gabriel >>> Associate Professor >>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>> Department of Computer Science University of Houston >>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
signature.asc
Description: OpenPGP digital signature