Ralph, You are right, please disregard my previous post, it was irrelevant.
i just noticed that unlike ompi v1.8 (hwloc 1.7.2 based => no warning), master has this warning (hwloc 1.9.1) i will build slurm vs a recent hwloc and see what happens (FWIW RHEL6 comes with hwloc 1.5, RHEL7 comes with hwloc 1.7 and both do *not* have this warning) Cheers, Gilles On 2014/12/11 12:02, Ralph Castain wrote: > Per his prior notes, he is using mpirun to launch his jobs. Brice has > confirmed that OMPI doesn't have that hwloc warning in it. So either he has > inadvertently linked against the Ubuntu system version of hwloc, or the > message must be coming from Slurm. > > >> On Dec 10, 2014, at 6:14 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >> Pim, >> >> at this stage, all i can do is acknowledge your slurm is configured to use >> cgroups. >> >> and based on your previous comment (e.g. problem only occurs with several >> jobs on the same node) >> that *could* be a bug in OpenMPI (or hwloc). >> >> by the way, how do you start your mpi application ? >> - do you use mpirun ? >> - do you use srun --resv-ports ? >> >> i'll try to reproduce this in my test environment. >> >> Cheers, >> >> Gilles >> >> On 2014/12/11 2:45, Pim Schellart wrote: >>> Dear Gilles et al., >>> >>> we tested with openmpi compiled from source (version 1.8.3) both with: >>> >>> ./configure --prefix=/usr/local/openmpi --disable-silent-rules >>> --with-libltdl=external --with-devel-headers --with-slurm >>> --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi >>> >>> and >>> >>> ./configure --prefix=/usr/local/openmpi --with-hwloc=/usr >>> --disable-silent-rules --with-libltdl=external --with-devel-headers >>> --with-slurm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi >>> >>> (e.g. with embedded and external hwloc) and the issue remains the same. >>> Meanwhile we have found another interesting detail. A job is started >>> consisting of four tasks split over two nodes. If this is the only job >>> running on those nodes the out-of-order warnings do not appear. However, if >>> multiple jobs are running the warnings do appear but only for the jobs that >>> are started later. We suspect that this is because for the first started >>> job the CPU cores assigned are 0 and 1 whereas they are different for the >>> later started jobs. I attached the output (including lstopo ---of xml >>> output (called for each task)) for both the working and broken case again. >>> >>> Kind regards, >>> >>> Pim Schellart >>> >>> >>> >>> >>>> On 09 Dec 2014, at 09:38, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>> wrote: >>>> >>>> Pim, >>>> >>>> if you configure OpenMPI with --with-hwloc=external (or something like >>>> --with-hwloc=/usr) it is very likely >>>> OpenMPI will use the same hwloc library (e.g. the "system" library) that >>>> is used by SLURM >>>> >>>> /* i do not know how Ubuntu packages OpenMPI ... */ >>>> >>>> >>>> The default (e.g. no --with-hwloc parameter in the configure command >>>> line) is to use the hwloc library that is embedded within OpenMPI >>>> >>>> Gilles >>>> >>>> On 2014/12/09 17:34, Pim Schellart wrote: >>>>> Ah, ok so that was where the confusion came from, I did see hwloc in the >>>>> SLURM sources but couldn't immediately figure out where exactly it was >>>>> used. We will try compiling openmpi with the embedded hwloc. Any >>>>> particular flags I should set? >>>>> >>>>>> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> >>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>> >>>>>> There is no linkage between slurm and ompi when it comes to hwloc. If >>>>>> you directly launch your app using srun, then slurm will use its version >>>>>> of hwloc to do the binding. If you use mpirun to launch the app, then >>>>>> we'll use our internal version to do it. >>>>>> >>>>>> The two are completely isolated from each other. >>>>>> >>>>>> >>>>>>> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> >>>>>>> <mailto:p.schell...@gmail.com> wrote: >>>>>>> >>>>>>> The version that "lstopo --version" reports is the same (1.8) on all >>>>>>> nodes, but we may indeed be hitting the second issue. We can try to >>>>>>> compile a new version of openmpi, but how do we ensure that the >>>>>>> external programs (e.g. SLURM) are using the same hwloc version as the >>>>>>> one embedded in openmpi? Is it enough to just compile hwloc 1.9 >>>>>>> separately as well and link against that? Also, if this is an issue, >>>>>>> should we file a bug against hwloc or openmpi on Ubuntu for mismatching >>>>>>> versions? >>>>>>> >>>>>>>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> >>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>> Hmmm...they probably linked that to the external, system hwloc >>>>>>>> version, so it sounds like one or more of your nodes has a different >>>>>>>> hwloc rpm on it. >>>>>>>> >>>>>>>> I couldn't leaf thru your output well enough to see all the lstopo >>>>>>>> versions, but you might check to ensure they are the same. >>>>>>>> >>>>>>>> Looking at the code base, you may also hit a problem here. OMPI 1.6 >>>>>>>> series was based on hwloc 1.3 - the output you sent indicated you have >>>>>>>> hwloc 1.8, which is quite a big change. OMPI 1.8 series is based on >>>>>>>> hwloc 1.9, so at least that is closer (though probably still a >>>>>>>> mismatch). >>>>>>>> >>>>>>>> Frankly, I'd just download and install an OMPI tarball myself and >>>>>>>> avoid these headaches. This mismatch in required versions is why we >>>>>>>> embed hwloc as it is a critical library for OMPI, and we had to ensure >>>>>>>> that the version matched our internal requirements. >>>>>>>> >>>>>>>> >>>>>>>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> >>>>>>>>> <mailto:p.schell...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> It is the default openmpi that comes with Ubuntu 14.04. >>>>>>>>> >>>>>>>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>>>> >>>>>>>>>> Pim: is this an OMPI you built, or one you were given somehow? If >>>>>>>>>> you built it, how did you configure it? >>>>>>>>>> >>>>>>>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> >>>>>>>>>>> <mailto:brice.gog...@inria.fr> wrote: >>>>>>>>>>> >>>>>>>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside >>>>>>>>>>> the >>>>>>>>>>> nodes. The XML warning is related to these restrictions inside the >>>>>>>>>>> node. >>>>>>>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc >>>>>>>>>>> somewhere. >>>>>>>>>>> >>>>>>>>>>> How do we check after install whether OMPI uses the embedded or the >>>>>>>>>>> system-wide hwloc? >>>>>>>>>>> >>>>>>>>>>> Brice >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit : >>>>>>>>>>>> Dear Ralph, >>>>>>>>>>>> >>>>>>>>>>>> the nodes are called coma## and as you can see in the logs the >>>>>>>>>>>> nodes of the broken example are the same as the nodes of the >>>>>>>>>>>> working one, so that doesn't seem to be the cause. Unless (very >>>>>>>>>>>> likely) I'm missing something. Anything else I can check? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Pim >>>>>>>>>>>> >>>>>>>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we >>>>>>>>>>>>> use, so there is no Slurm interaction to be considered. The most >>>>>>>>>>>>> likely cause is that one or more of your nodes is picking up a >>>>>>>>>>>>> different version of OMPI. So things "work" if you happen to get >>>>>>>>>>>>> nodes where all the versions match, and "fail" when you get a >>>>>>>>>>>>> combination that includes a different version. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there some way you can narrow down your search to find the >>>>>>>>>>>>> node(s) that are picking up the different version? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart >>>>>>>>>>>>>> <p.schell...@gmail.com> <mailto:p.schell...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dear Brice, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am not sure why this is happening since all code seems to be >>>>>>>>>>>>>> using the same hwloc library version (1.8) but it does :) An MPI >>>>>>>>>>>>>> program is started through SLURM on two nodes with four CPU >>>>>>>>>>>>>> cores total (divided over the nodes) using the following script: >>>>>>>>>>>>>> >>>>>>>>>>>>>> #! /bin/bash >>>>>>>>>>>>>> #SBATCH -N 2 -n 4 >>>>>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version >>>>>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml >>>>>>>>>>>>>> /usr/bin/mpiexec /path/to/my_mpi_code >>>>>>>>>>>>>> >>>>>>>>>>>>>> When this is submitted multiple times it gives "out-of-order" >>>>>>>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 >>>>>>>>>>>>>> cases. I attached the output (with xml) for both the working and >>>>>>>>>>>>>> `broken` case. Note that the xml is of course printed >>>>>>>>>>>>>> (differently) multiple times for each task/core. As always, any >>>>>>>>>>>>>> help would be appreciated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Pim Schellart >>>>>>>>>>>>>> >>>>>>>>>>>>>> P.S. $ mpirun --version >>>>>>>>>>>>>> mpirun (Open MPI) 1.6.5 >>>>>>>>>>>>>> >>>>>>>>>>>>>> <broken.log><working.log> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> >>>>>>>>>>>>>>> <mailto:brice.gog...@inria.fr> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello >>>>>>>>>>>>>>> The github issue you're refering to was closed 18 months ago. >>>>>>>>>>>>>>> The >>>>>>>>>>>>>>> warning (it's not an error) is only supposed to appear if you're >>>>>>>>>>>>>>> importing in a recent hwloc a XML that was exported from a old >>>>>>>>>>>>>>> hwloc. I >>>>>>>>>>>>>>> don't see how that could happen when using Open MPI since the >>>>>>>>>>>>>>> hwloc >>>>>>>>>>>>>>> versions on both sides is the same. >>>>>>>>>>>>>>> Make sure you're not confusing with another error described here >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> <http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error> >>>>>>>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc >>>>>>>>>>>>>>> versions as well >>>>>>>>>>>>>>> as the XML lstopo output on the nodes that raise the warning >>>>>>>>>>>>>>> (lstopo >>>>>>>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as >>>>>>>>>>>>>>> hwloc-us...@open-mpi.org <mailto:hwloc-us...@open-mpi.org> or >>>>>>>>>>>>>>> hwloc-de...@open-mpi.org <mailto:hwloc-de...@open-mpi.org> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Brice >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit : >>>>>>>>>>>>>>>> Dear OpenMPI developers, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> this might be a bit off topic but when using the SLURM >>>>>>>>>>>>>>>> scheduler (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) >>>>>>>>>>>>>>>> hwloc sometimes gives a "out-of-order topology discovery" >>>>>>>>>>>>>>>> error. According to issue #103 on github >>>>>>>>>>>>>>>> (https://github.com/open-mpi/hwloc/issues/103 >>>>>>>>>>>>>>>> <https://github.com/open-mpi/hwloc/issues/103>) this error was >>>>>>>>>>>>>>>> discussed before and it was possible to sort it out in >>>>>>>>>>>>>>>> "insert_object_by_parent", is this still considered? If not, >>>>>>>>>>>>>>>> what (top level) hwloc API call should we look for in the >>>>>>>>>>>>>>>> SLURM sources to start debugging? Any help will be most >>>>>>>>>>>>>>>> welcome. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Pim Schellart >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16441.php> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php >>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16447.php> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php >>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16448.php> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16449.php> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php >>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16450.php> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16451.php> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16452.php> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16453.php> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16458.php> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16460.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16460.php> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16464.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16464.php> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16465.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16465.php> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16492.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16492.php> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16498.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16499.php