You should be able to grab an Open MPI 1.7.x nightly tarball, and it should 
have the newer hwloc that fixes this issue.

Can you give it a whirl and see it works for you?


On Nov 4, 2013, at 1:49 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Thanks. That's indeed the same bug that you got in Open MPI (reuse of a
> hwloc cpuset structure that was freed earlier). It's a nasty bug that
> happens when reloading from XML on big machines like yours (that
> explains why lstopo works while xmlbuffer and OMPI fail). It was fixed
> in hwloc v1.7.1 (hence will be fixed in Open MPI 1.7.4 from what I
> understand) but the fix was too big to be backported to older hwloc/OMPI.
> 
> You should be able to work around the problem for now by setting
> HWLOC_GROUPING=0 in your environment.
> 
> I re-added hwloc-users to CC so that the bug is officially "closed".
> 
> Brice
> 
> 
> 
> 
> Le 04/11/2013 22:33, Paul Kapinos a écrit :
>> Hello again,
>> I'm not allowed to publish to Hardware locality user list so I omit it
>> now.
>> 
>> On 11/04/13 14:19, Brice Goglin wrote:
>>> Le 04/11/2013 11:44, Paul Kapinos a écrit :
>>>> Hello all,
>>>> I.
>>>> sorry for this paleontologic excursion. (The 4 years old 'lstopo'
>>>> binary was just in my private bin folder and still being runnable..)
>>>> 
>>>> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4
>>>> (SL/6.4).
>>>> 
>>>> II.
>>>> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2
>>>> as Brice suggested, by 'confugure' + 'make test' - logs attached.
>>>> 
>>>> 1.5.2 fails:
>>>>> /bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst
>>>>> FAIL: xmlbuffer
>>> 
>>> Can you give more details about this segfault?
>>> 
>>> Try (from the build tree):
>>> $ libtool --mode=execute gdb xmlbuffer
>>> then type 'run'
>>> when it crashes, type 'bt full' and send the output.
>> 
>> see attached file trace_1.5.2.txt
>> 
>> 
>> 
>> 
>> 
>>> 
>>> Then please also run from hwloc 1.5.2:
>>> * "lstopo foo.xml" and send "foo.xml"
>>> * "hwloc-gather-topology foo" and send "foo.tar.bz2"
>> 
>> also attached but with non-empty names :o)
>> 
>> 
>> 
>> Best
>> 
>> Paul
>>> 
>>>> whereby 1.7.2 seem to be OK.
>>>> 
>>>> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated?
>>>> If so, the original issue should be fixed by this, huh?
>>> 
>>> Hard to say before we get details about the crash in xmlbuffer above.
>>> 
>>> Brice
>>> 
>>> 
>>>> 
>>>> Many thanks for your help!
>>>> Best
>>>> 
>>>> Paul
>>>> 
>>>> pk224850@linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5
>>>> $ lstopo lstopo_linuxitvc00_1.5.txt
>>>> $ lstopo lstopo_linuxitvc00_1.5.xml
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11/01/13 15:37, Brice Goglin wrote:
>>>>> Sorry, I missed the mail on OMPI-users.
>>>>> 
>>>>> This hwloc looks veeeeeeeeeeeery old. We don't have Misc objects
>>>>> instead of
>>>>> Groups since we switched from 0.9 to 1.0. You should regenerate the
>>>>> XML file
>>>>> with a hwloc version that came out after the big bang (or better,
>>>>> after the
>>>>> asteroid killed the dinosaurs). Please resend that XML from a recent
>>>>> hwloc so
>>>>> that we can get a better clue of the problem.
>>>>> 
>>>>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading
>>>>> hwloc 1.5.3
>>>>> and running make check on that machine. And try again with hwloc
>>>>> 1.7.2 in case
>>>>> that's already fixed.
>>>>> 
>>>>> thanks
>>>>> Brice
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit :
>>>>>> Paul Kapinos originally reported this issue on the OMPI users list.
>>>>>> 
>>>>>> He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2
>>>>>> (note that
>>>>>> OMPI 1.7.4 will use hwloc 1.7.2).
>>>>>> 
>>>>>> I tried to read the xml file he provided with the git hwloc master
>>>>>> HEAD, and
>>>>>> it fails:
>>>>>> 
>>>>>> -----
>>>>>> ❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml
>>>>>> ignoring depth attribute for object type without depth
>>>>>> ignoring depth attribute for object type without depth
>>>>>> XML component discovery failed.
>>>>>> hwloc_topology_load() failed (Invalid argument).
>>>>>> -----
>>>>>> 
>>>>>> Any idea what's happening here?
>>>>>> 
>>>>>> BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch
>>>>>> (since
>>>>>> OMPI v1.7 is now up to hwloc 1.7.2).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Oct 31, 2013, at 1:28 PM, Paul Kapinos
>>>>>> <kapi...@rz.rwth-aachen.de> wrote:
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere
>>>>>> in-deepth of 'hwlock' library - see the attached screenshot.
>>>>>>> 
>>>>>>> Because the error is strongly aligned to just one single node,
>>>>>> which in turn
>>>>>> is kinda special one (see output of 'lstopo -'), it smells like an
>>>>>> error in
>>>>>> the 'hwlock' library.
>>>>>>> 
>>>>>>> Is there a way to disable hwlock or to debug it in somehow way?
>>>>>>> (besides to build a debug version of hwlock and OpenMPI)
>>>>>>> 
>>>>>>> Best
>>>>>>> 
>>>>>>> Paul
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>>>>>> RWTH Aachen University, Center for Computing and Communication
>>>>>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>>>>> Tel: +49 241/80-24915
>>>>>>> 
>>>>>> <lstopo_linuxitvc00.txt><opal_hwlock_SIGSEGV.png><lstopo_linuxitvc00.xml>_______________________________________________
>>>>>> 
>>>>>> 
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Jeff Squyres
>>>>>> jsquy...@cisco.com
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> hwloc-users mailing list
>>>>>> hwloc-us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to