Ah, ok so that was where the confusion came from, I did see hwloc in the SLURM 
sources but couldn’t immediately figure out where exactly it was used. We will 
try compiling openmpi with the embedded hwloc. Any particular flags I should 
set?

> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> wrote:
> 
> There is no linkage between slurm and ompi when it comes to hwloc. If you 
> directly launch your app using srun, then slurm will use its version of hwloc 
> to do the binding. If you use mpirun to launch the app, then we’ll use our 
> internal version to do it.
> 
> The two are completely isolated from each other.
> 
> 
>> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>> 
>> The version that “lstopo --version” reports is the same (1.8) on all nodes, 
>> but we may indeed be hitting the second issue. We can try to compile a new 
>> version of openmpi, but how do we ensure that the external programs (e.g. 
>> SLURM) are using the same hwloc version as the one embedded in openmpi? Is 
>> it enough to just compile hwloc 1.9 separately as well and link against 
>> that? Also, if this is an issue, should we file a bug against hwloc or 
>> openmpi on Ubuntu for mismatching versions?
>> 
>>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> Hmmm…they probably linked that to the external, system hwloc version, so it 
>>> sounds like one or more of your nodes has a different hwloc rpm on it.
>>> 
>>> I couldn’t leaf thru your output well enough to see all the lstopo 
>>> versions, but you might check to ensure they are the same.
>>> 
>>> Looking at the code base, you may also hit a problem here. OMPI 1.6 series 
>>> was based on hwloc 1.3 - the output you sent indicated you have hwloc 1.8, 
>>> which is quite a big change. OMPI 1.8 series is based on hwloc 1.9, so at 
>>> least that is closer (though probably still a mismatch).
>>> 
>>> Frankly, I’d just download and install an OMPI tarball myself and avoid 
>>> these headaches. This mismatch in required versions is why we embed hwloc 
>>> as it is a critical library for OMPI, and we had to ensure that the version 
>>> matched our internal requirements.
>>> 
>>> 
>>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>>>> 
>>>> It is the default openmpi that comes with Ubuntu 14.04.
>>>> 
>>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>> Pim: is this an OMPI you built, or one you were given somehow? If you 
>>>>> built it, how did you configure it?
>>>>> 
>>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>>> 
>>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside the
>>>>>> nodes. The XML warning is related to these restrictions inside the node.
>>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.
>>>>>> 
>>>>>> How do we check after install whether OMPI uses the embedded or the
>>>>>> system-wide hwloc?
>>>>>> 
>>>>>> Brice
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit :
>>>>>>> Dear Ralph,
>>>>>>> 
>>>>>>> the nodes are called coma## and as you can see in the logs the nodes of 
>>>>>>> the broken example are the same as the nodes of the working one, so 
>>>>>>> that doesn’t seem to be the cause. Unless (very likely) I’m missing 
>>>>>>> something. Anything else I can check?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Pim
>>>>>>> 
>>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we use, 
>>>>>>>> so there is no Slurm interaction to be considered. The most likely 
>>>>>>>> cause is that one or more of your nodes is picking up a different 
>>>>>>>> version of OMPI. So things “work” if you happen to get nodes where all 
>>>>>>>> the versions match, and “fail” when you get a combination that 
>>>>>>>> includes a different version.
>>>>>>>> 
>>>>>>>> Is there some way you can narrow down your search to find the node(s) 
>>>>>>>> that are picking up the different version?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Dear Brice,
>>>>>>>>> 
>>>>>>>>> I am not sure why this is happening since all code seems to be using 
>>>>>>>>> the same hwloc library version (1.8) but it does :) An MPI program is 
>>>>>>>>> started through SLURM on two nodes with four CPU cores total (divided 
>>>>>>>>> over the nodes) using the following script:
>>>>>>>>> 
>>>>>>>>> #! /bin/bash
>>>>>>>>> #SBATCH -N 2 -n 4
>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version
>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>>>>>>>>> /usr/bin/mpiexec  /path/to/my_mpi_code
>>>>>>>>> 
>>>>>>>>> When this is submitted multiple times it gives “out-of-order” 
>>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 
>>>>>>>>> cases. I attached the output (with xml) for both the working and 
>>>>>>>>> `broken` case. Note that the xml is of course printed (differently) 
>>>>>>>>> multiple times for each task/core. As always, any help would be 
>>>>>>>>> appreciated.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> 
>>>>>>>>> Pim Schellart
>>>>>>>>> 
>>>>>>>>> P.S. $ mpirun --version
>>>>>>>>> mpirun (Open MPI) 1.6.5
>>>>>>>>> 
>>>>>>>>> <broken.log><working.log>
>>>>>>>>> 
>>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hello
>>>>>>>>>> The github issue you're refering to was closed 18 months ago. The
>>>>>>>>>> warning (it's not an error) is only supposed to appear if you're
>>>>>>>>>> importing in a recent hwloc a XML that was exported from a old 
>>>>>>>>>> hwloc. I
>>>>>>>>>> don't see how that could happen when using Open MPI since the hwloc
>>>>>>>>>> versions on both sides is the same.
>>>>>>>>>> Make sure you're not confusing with another error described here
>>>>>>>>>> 
>>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc versions as 
>>>>>>>>>> well
>>>>>>>>>> as the XML lstopo output on the nodes that raise the warning (lstopo
>>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as
>>>>>>>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org
>>>>>>>>>> Thanks
>>>>>>>>>> Brice
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>>>>>>>>> Dear OpenMPI developers,
>>>>>>>>>>> 
>>>>>>>>>>> this might be a bit off topic but when using the SLURM scheduler 
>>>>>>>>>>> (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes 
>>>>>>>>>>> gives a "out-of-order topology discovery” error. According to issue 
>>>>>>>>>>> #103 on github (https://github.com/open-mpi/hwloc/issues/103) this 
>>>>>>>>>>> error was discussed before and it was possible to sort it out in 
>>>>>>>>>>> “insert_object_by_parent”, is this still considered? If not, what 
>>>>>>>>>>> (top level) hwloc API call should we look for in the SLURM sources 
>>>>>>>>>>> to start debugging? Any help will be most welcome.
>>>>>>>>>>> 
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> 
>>>>>>>>>>> Pim Schellart
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> Link to this post: 
>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16460.php

Reply via email to