Ralph,

You are right,
please disregard my previous post, it was irrelevant.

i just noticed that unlike ompi v1.8 (hwloc 1.7.2 based => no warning),
master has this warning (hwloc 1.9.1)

i will build slurm vs a recent hwloc and see what happens
(FWIW RHEL6 comes with hwloc 1.5, RHEL7 comes with hwloc 1.7 and both do
*not* have this warning)

Cheers,

Gilles

On 2014/12/11 12:02, Ralph Castain wrote:
> Per his prior notes, he is using mpirun to launch his jobs. Brice has 
> confirmed that OMPI doesn't have that hwloc warning in it. So either he has 
> inadvertently linked against the Ubuntu system version of hwloc, or the 
> message must be coming from Slurm.
>
>
>> On Dec 10, 2014, at 6:14 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>>
>> Pim,
>>
>> at this stage, all i can do is acknowledge your slurm is configured to use 
>> cgroups.
>>
>> and based on your previous comment (e.g. problem only occurs with several 
>> jobs on the same node)
>> that *could* be a bug in OpenMPI (or hwloc).
>>
>> by the way, how do you start your mpi application ?
>> - do you use mpirun ?
>> - do you use srun --resv-ports ?
>>
>> i'll try to reproduce this in my test environment.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/11 2:45, Pim Schellart wrote:
>>> Dear Gilles et al.,
>>>
>>> we tested with openmpi compiled from source (version 1.8.3) both with:
>>>
>>> ./configure --prefix=/usr/local/openmpi --disable-silent-rules 
>>> --with-libltdl=external --with-devel-headers --with-slurm 
>>> --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi
>>>
>>> and
>>>
>>> ./configure --prefix=/usr/local/openmpi --with-hwloc=/usr 
>>> --disable-silent-rules --with-libltdl=external --with-devel-headers 
>>> --with-slurm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi
>>>
>>> (e.g. with embedded and external hwloc) and the issue remains the same. 
>>> Meanwhile we have found another interesting detail. A job is started 
>>> consisting of four tasks split over two nodes. If this is the only job 
>>> running on those nodes the out-of-order warnings do not appear. However, if 
>>> multiple jobs are running the warnings do appear but only for the jobs that 
>>> are started later. We suspect that this is because for the first started 
>>> job the CPU cores assigned are 0 and 1 whereas they are different for the 
>>> later started jobs. I attached the output (including lstopo ---of xml 
>>> output (called for each task)) for both the working and broken case again.
>>>
>>> Kind regards,
>>>
>>> Pim Schellart
>>>
>>>
>>>
>>>
>>>> On 09 Dec 2014, at 09:38, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> 
>>>> wrote:
>>>>
>>>> Pim,
>>>>
>>>> if you configure OpenMPI with --with-hwloc=external (or something like
>>>> --with-hwloc=/usr) it is very likely
>>>> OpenMPI will use the same hwloc library (e.g. the "system" library) that
>>>> is used by SLURM
>>>>
>>>> /* i do not know how Ubuntu packages OpenMPI ... */
>>>>
>>>>
>>>> The default (e.g. no --with-hwloc parameter in the configure command
>>>> line) is to use the hwloc library that is embedded within OpenMPI
>>>>
>>>> Gilles
>>>>
>>>> On 2014/12/09 17:34, Pim Schellart wrote:
>>>>> Ah, ok so that was where the confusion came from, I did see hwloc in the 
>>>>> SLURM sources but couldn't immediately figure out where exactly it was 
>>>>> used. We will try compiling openmpi with the embedded hwloc. Any 
>>>>> particular flags I should set?
>>>>>
>>>>>> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> 
>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>
>>>>>> There is no linkage between slurm and ompi when it comes to hwloc. If 
>>>>>> you directly launch your app using srun, then slurm will use its version 
>>>>>> of hwloc to do the binding. If you use mpirun to launch the app, then 
>>>>>> we'll use our internal version to do it.
>>>>>>
>>>>>> The two are completely isolated from each other.
>>>>>>
>>>>>>
>>>>>>> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> 
>>>>>>> <mailto:p.schell...@gmail.com> wrote:
>>>>>>>
>>>>>>> The version that "lstopo --version" reports is the same (1.8) on all 
>>>>>>> nodes, but we may indeed be hitting the second issue. We can try to 
>>>>>>> compile a new version of openmpi, but how do we ensure that the 
>>>>>>> external programs (e.g. SLURM) are using the same hwloc version as the 
>>>>>>> one embedded in openmpi? Is it enough to just compile hwloc 1.9 
>>>>>>> separately as well and link against that? Also, if this is an issue, 
>>>>>>> should we file a bug against hwloc or openmpi on Ubuntu for mismatching 
>>>>>>> versions?
>>>>>>>
>>>>>>>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> 
>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>>
>>>>>>>> Hmmm...they probably linked that to the external, system hwloc 
>>>>>>>> version, so it sounds like one or more of your nodes has a different 
>>>>>>>> hwloc rpm on it.
>>>>>>>>
>>>>>>>> I couldn't leaf thru your output well enough to see all the lstopo 
>>>>>>>> versions, but you might check to ensure they are the same.
>>>>>>>>
>>>>>>>> Looking at the code base, you may also hit a problem here. OMPI 1.6 
>>>>>>>> series was based on hwloc 1.3 - the output you sent indicated you have 
>>>>>>>> hwloc 1.8, which is quite a big change. OMPI 1.8 series is based on 
>>>>>>>> hwloc 1.9, so at least that is closer (though probably still a 
>>>>>>>> mismatch).
>>>>>>>>
>>>>>>>> Frankly, I'd just download and install an OMPI tarball myself and 
>>>>>>>> avoid these headaches. This mismatch in required versions is why we 
>>>>>>>> embed hwloc as it is a critical library for OMPI, and we had to ensure 
>>>>>>>> that the version matched our internal requirements.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> 
>>>>>>>>> <mailto:p.schell...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> It is the default openmpi that comes with Ubuntu 14.04.
>>>>>>>>>
>>>>>>>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Pim: is this an OMPI you built, or one you were given somehow? If 
>>>>>>>>>> you built it, how did you configure it?
>>>>>>>>>>
>>>>>>>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> 
>>>>>>>>>>> <mailto:brice.gog...@inria.fr> wrote:
>>>>>>>>>>>
>>>>>>>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside 
>>>>>>>>>>> the
>>>>>>>>>>> nodes. The XML warning is related to these restrictions inside the 
>>>>>>>>>>> node.
>>>>>>>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc 
>>>>>>>>>>> somewhere.
>>>>>>>>>>>
>>>>>>>>>>> How do we check after install whether OMPI uses the embedded or the
>>>>>>>>>>> system-wide hwloc?
>>>>>>>>>>>
>>>>>>>>>>> Brice
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit :
>>>>>>>>>>>> Dear Ralph,
>>>>>>>>>>>>
>>>>>>>>>>>> the nodes are called coma## and as you can see in the logs the 
>>>>>>>>>>>> nodes of the broken example are the same as the nodes of the 
>>>>>>>>>>>> working one, so that doesn't seem to be the cause. Unless (very 
>>>>>>>>>>>> likely) I'm missing something. Anything else I can check?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Pim
>>>>>>>>>>>>
>>>>>>>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we 
>>>>>>>>>>>>> use, so there is no Slurm interaction to be considered. The most 
>>>>>>>>>>>>> likely cause is that one or more of your nodes is picking up a 
>>>>>>>>>>>>> different version of OMPI. So things "work" if you happen to get 
>>>>>>>>>>>>> nodes where all the versions match, and "fail" when you get a 
>>>>>>>>>>>>> combination that includes a different version.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there some way you can narrow down your search to find the 
>>>>>>>>>>>>> node(s) that are picking up the different version?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart 
>>>>>>>>>>>>>> <p.schell...@gmail.com> <mailto:p.schell...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Brice,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not sure why this is happening since all code seems to be 
>>>>>>>>>>>>>> using the same hwloc library version (1.8) but it does :) An MPI 
>>>>>>>>>>>>>> program is started through SLURM on two nodes with four CPU 
>>>>>>>>>>>>>> cores total (divided over the nodes) using the following script:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #! /bin/bash
>>>>>>>>>>>>>> #SBATCH -N 2 -n 4
>>>>>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version
>>>>>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>>>>>>>>>>>>>> /usr/bin/mpiexec  /path/to/my_mpi_code
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When this is submitted multiple times it gives "out-of-order" 
>>>>>>>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 
>>>>>>>>>>>>>> cases. I attached the output (with xml) for both the working and 
>>>>>>>>>>>>>> `broken` case. Note that the xml is of course printed 
>>>>>>>>>>>>>> (differently) multiple times for each task/core. As always, any 
>>>>>>>>>>>>>> help would be appreciated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pim Schellart
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> P.S. $ mpirun --version
>>>>>>>>>>>>>> mpirun (Open MPI) 1.6.5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <broken.log><working.log>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> 
>>>>>>>>>>>>>>> <mailto:brice.gog...@inria.fr> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello
>>>>>>>>>>>>>>> The github issue you're refering to was closed 18 months ago. 
>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>> warning (it's not an error) is only supposed to appear if you're
>>>>>>>>>>>>>>> importing in a recent hwloc a XML that was exported from a old 
>>>>>>>>>>>>>>> hwloc. I
>>>>>>>>>>>>>>> don't see how that could happen when using Open MPI since the 
>>>>>>>>>>>>>>> hwloc
>>>>>>>>>>>>>>> versions on both sides is the same.
>>>>>>>>>>>>>>> Make sure you're not confusing with another error described here
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> <http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error>
>>>>>>>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc 
>>>>>>>>>>>>>>> versions as well
>>>>>>>>>>>>>>> as the XML lstopo output on the nodes that raise the warning 
>>>>>>>>>>>>>>> (lstopo
>>>>>>>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as
>>>>>>>>>>>>>>> hwloc-us...@open-mpi.org <mailto:hwloc-us...@open-mpi.org> or 
>>>>>>>>>>>>>>> hwloc-de...@open-mpi.org <mailto:hwloc-de...@open-mpi.org>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Brice
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>>>>>>>>>>>>>> Dear OpenMPI developers,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> this might be a bit off topic but when using the SLURM 
>>>>>>>>>>>>>>>> scheduler (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) 
>>>>>>>>>>>>>>>> hwloc sometimes gives a "out-of-order topology discovery" 
>>>>>>>>>>>>>>>> error. According to issue #103 on github 
>>>>>>>>>>>>>>>> (https://github.com/open-mpi/hwloc/issues/103 
>>>>>>>>>>>>>>>> <https://github.com/open-mpi/hwloc/issues/103>) this error was 
>>>>>>>>>>>>>>>> discussed before and it was possible to sort it out in 
>>>>>>>>>>>>>>>> "insert_object_by_parent", is this still considered? If not, 
>>>>>>>>>>>>>>>> what (top level) hwloc API call should we look for in the 
>>>>>>>>>>>>>>>> SLURM sources to start debugging? Any help will be most 
>>>>>>>>>>>>>>>> welcome.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Pim Schellart
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>>>>>> Subscription: 
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16441.php>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php 
>>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16447.php>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php 
>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16448.php>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php 
>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16449.php>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>> Link to this post: 
>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php 
>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16450.php>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16451.php>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16452.php>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php 
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16453.php>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16458.php>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16460.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16460.php>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16464.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16464.php>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16465.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16465.php>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16492.php 
>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16492.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16498.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16499.php

Reply via email to