You should be able to grab an Open MPI 1.7.x nightly tarball, and it should have the newer hwloc that fixes this issue.
Can you give it a whirl and see it works for you? On Nov 4, 2013, at 1:49 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > Thanks. That's indeed the same bug that you got in Open MPI (reuse of a > hwloc cpuset structure that was freed earlier). It's a nasty bug that > happens when reloading from XML on big machines like yours (that > explains why lstopo works while xmlbuffer and OMPI fail). It was fixed > in hwloc v1.7.1 (hence will be fixed in Open MPI 1.7.4 from what I > understand) but the fix was too big to be backported to older hwloc/OMPI. > > You should be able to work around the problem for now by setting > HWLOC_GROUPING=0 in your environment. > > I re-added hwloc-users to CC so that the bug is officially "closed". > > Brice > > > > > Le 04/11/2013 22:33, Paul Kapinos a écrit : >> Hello again, >> I'm not allowed to publish to Hardware locality user list so I omit it >> now. >> >> On 11/04/13 14:19, Brice Goglin wrote: >>> Le 04/11/2013 11:44, Paul Kapinos a écrit : >>>> Hello all, >>>> I. >>>> sorry for this paleontologic excursion. (The 4 years old 'lstopo' >>>> binary was just in my private bin folder and still being runnable..) >>>> >>>> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4 >>>> (SL/6.4). >>>> >>>> II. >>>> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2 >>>> as Brice suggested, by 'confugure' + 'make test' - logs attached. >>>> >>>> 1.5.2 fails: >>>>> /bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst >>>>> FAIL: xmlbuffer >>> >>> Can you give more details about this segfault? >>> >>> Try (from the build tree): >>> $ libtool --mode=execute gdb xmlbuffer >>> then type 'run' >>> when it crashes, type 'bt full' and send the output. >> >> see attached file trace_1.5.2.txt >> >> >> >> >> >>> >>> Then please also run from hwloc 1.5.2: >>> * "lstopo foo.xml" and send "foo.xml" >>> * "hwloc-gather-topology foo" and send "foo.tar.bz2" >> >> also attached but with non-empty names :o) >> >> >> >> Best >> >> Paul >>> >>>> whereby 1.7.2 seem to be OK. >>>> >>>> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated? >>>> If so, the original issue should be fixed by this, huh? >>> >>> Hard to say before we get details about the crash in xmlbuffer above. >>> >>> Brice >>> >>> >>>> >>>> Many thanks for your help! >>>> Best >>>> >>>> Paul >>>> >>>> pk224850@linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5 >>>> $ lstopo lstopo_linuxitvc00_1.5.txt >>>> $ lstopo lstopo_linuxitvc00_1.5.xml >>>> >>>> >>>> >>>> >>>> >>>> On 11/01/13 15:37, Brice Goglin wrote: >>>>> Sorry, I missed the mail on OMPI-users. >>>>> >>>>> This hwloc looks veeeeeeeeeeeery old. We don't have Misc objects >>>>> instead of >>>>> Groups since we switched from 0.9 to 1.0. You should regenerate the >>>>> XML file >>>>> with a hwloc version that came out after the big bang (or better, >>>>> after the >>>>> asteroid killed the dinosaurs). Please resend that XML from a recent >>>>> hwloc so >>>>> that we can get a better clue of the problem. >>>>> >>>>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading >>>>> hwloc 1.5.3 >>>>> and running make check on that machine. And try again with hwloc >>>>> 1.7.2 in case >>>>> that's already fixed. >>>>> >>>>> thanks >>>>> Brice >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit : >>>>>> Paul Kapinos originally reported this issue on the OMPI users list. >>>>>> >>>>>> He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2 >>>>>> (note that >>>>>> OMPI 1.7.4 will use hwloc 1.7.2). >>>>>> >>>>>> I tried to read the xml file he provided with the git hwloc master >>>>>> HEAD, and >>>>>> it fails: >>>>>> >>>>>> ----- >>>>>> ❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml >>>>>> ignoring depth attribute for object type without depth >>>>>> ignoring depth attribute for object type without depth >>>>>> XML component discovery failed. >>>>>> hwloc_topology_load() failed (Invalid argument). >>>>>> ----- >>>>>> >>>>>> Any idea what's happening here? >>>>>> >>>>>> BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch >>>>>> (since >>>>>> OMPI v1.7 is now up to hwloc 1.7.2). >>>>>> >>>>>> >>>>>> >>>>>> On Oct 31, 2013, at 1:28 PM, Paul Kapinos >>>>>> <kapi...@rz.rwth-aachen.de> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere >>>>>> in-deepth of 'hwlock' library - see the attached screenshot. >>>>>>> >>>>>>> Because the error is strongly aligned to just one single node, >>>>>> which in turn >>>>>> is kinda special one (see output of 'lstopo -'), it smells like an >>>>>> error in >>>>>> the 'hwlock' library. >>>>>>> >>>>>>> Is there a way to disable hwlock or to debug it in somehow way? >>>>>>> (besides to build a debug version of hwlock and OpenMPI) >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Paul >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing, >>>>>>> RWTH Aachen University, Center for Computing and Communication >>>>>>> Seffenter Weg 23, D 52074 Aachen (Germany) >>>>>>> Tel: +49 241/80-24915 >>>>>>> >>>>>> <lstopo_linuxitvc00.txt><opal_hwlock_SIGSEGV.png><lstopo_linuxitvc00.xml>_______________________________________________ >>>>>> >>>>>> >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> hwloc-users mailing list >>>>>> hwloc-us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>>>> >>>> >>>> >>> >> >> > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/