Ah, ok so that was where the confusion came from, I did see hwloc in the SLURM sources but couldn’t immediately figure out where exactly it was used. We will try compiling openmpi with the embedded hwloc. Any particular flags I should set?
> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> wrote: > > There is no linkage between slurm and ompi when it comes to hwloc. If you > directly launch your app using srun, then slurm will use its version of hwloc > to do the binding. If you use mpirun to launch the app, then we’ll use our > internal version to do it. > > The two are completely isolated from each other. > > >> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> wrote: >> >> The version that “lstopo --version” reports is the same (1.8) on all nodes, >> but we may indeed be hitting the second issue. We can try to compile a new >> version of openmpi, but how do we ensure that the external programs (e.g. >> SLURM) are using the same hwloc version as the one embedded in openmpi? Is >> it enough to just compile hwloc 1.9 separately as well and link against >> that? Also, if this is an issue, should we file a bug against hwloc or >> openmpi on Ubuntu for mismatching versions? >> >>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> Hmmm…they probably linked that to the external, system hwloc version, so it >>> sounds like one or more of your nodes has a different hwloc rpm on it. >>> >>> I couldn’t leaf thru your output well enough to see all the lstopo >>> versions, but you might check to ensure they are the same. >>> >>> Looking at the code base, you may also hit a problem here. OMPI 1.6 series >>> was based on hwloc 1.3 - the output you sent indicated you have hwloc 1.8, >>> which is quite a big change. OMPI 1.8 series is based on hwloc 1.9, so at >>> least that is closer (though probably still a mismatch). >>> >>> Frankly, I’d just download and install an OMPI tarball myself and avoid >>> these headaches. This mismatch in required versions is why we embed hwloc >>> as it is a critical library for OMPI, and we had to ensure that the version >>> matched our internal requirements. >>> >>> >>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> wrote: >>>> >>>> It is the default openmpi that comes with Ubuntu 14.04. >>>> >>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Pim: is this an OMPI you built, or one you were given somehow? If you >>>>> built it, how did you configure it? >>>>> >>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote: >>>>>> >>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside the >>>>>> nodes. The XML warning is related to these restrictions inside the node. >>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere. >>>>>> >>>>>> How do we check after install whether OMPI uses the embedded or the >>>>>> system-wide hwloc? >>>>>> >>>>>> Brice >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit : >>>>>>> Dear Ralph, >>>>>>> >>>>>>> the nodes are called coma## and as you can see in the logs the nodes of >>>>>>> the broken example are the same as the nodes of the working one, so >>>>>>> that doesn’t seem to be the cause. Unless (very likely) I’m missing >>>>>>> something. Anything else I can check? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Pim >>>>>>> >>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we use, >>>>>>>> so there is no Slurm interaction to be considered. The most likely >>>>>>>> cause is that one or more of your nodes is picking up a different >>>>>>>> version of OMPI. So things “work” if you happen to get nodes where all >>>>>>>> the versions match, and “fail” when you get a combination that >>>>>>>> includes a different version. >>>>>>>> >>>>>>>> Is there some way you can narrow down your search to find the node(s) >>>>>>>> that are picking up the different version? >>>>>>>> >>>>>>>> >>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Dear Brice, >>>>>>>>> >>>>>>>>> I am not sure why this is happening since all code seems to be using >>>>>>>>> the same hwloc library version (1.8) but it does :) An MPI program is >>>>>>>>> started through SLURM on two nodes with four CPU cores total (divided >>>>>>>>> over the nodes) using the following script: >>>>>>>>> >>>>>>>>> #! /bin/bash >>>>>>>>> #SBATCH -N 2 -n 4 >>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version >>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml >>>>>>>>> /usr/bin/mpiexec /path/to/my_mpi_code >>>>>>>>> >>>>>>>>> When this is submitted multiple times it gives “out-of-order” >>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 >>>>>>>>> cases. I attached the output (with xml) for both the working and >>>>>>>>> `broken` case. Note that the xml is of course printed (differently) >>>>>>>>> multiple times for each task/core. As always, any help would be >>>>>>>>> appreciated. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Pim Schellart >>>>>>>>> >>>>>>>>> P.S. $ mpirun --version >>>>>>>>> mpirun (Open MPI) 1.6.5 >>>>>>>>> >>>>>>>>> <broken.log><working.log> >>>>>>>>> >>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote: >>>>>>>>>> >>>>>>>>>> Hello >>>>>>>>>> The github issue you're refering to was closed 18 months ago. The >>>>>>>>>> warning (it's not an error) is only supposed to appear if you're >>>>>>>>>> importing in a recent hwloc a XML that was exported from a old >>>>>>>>>> hwloc. I >>>>>>>>>> don't see how that could happen when using Open MPI since the hwloc >>>>>>>>>> versions on both sides is the same. >>>>>>>>>> Make sure you're not confusing with another error described here >>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error >>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc versions as >>>>>>>>>> well >>>>>>>>>> as the XML lstopo output on the nodes that raise the warning (lstopo >>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as >>>>>>>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org >>>>>>>>>> Thanks >>>>>>>>>> Brice >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit : >>>>>>>>>>> Dear OpenMPI developers, >>>>>>>>>>> >>>>>>>>>>> this might be a bit off topic but when using the SLURM scheduler >>>>>>>>>>> (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes >>>>>>>>>>> gives a "out-of-order topology discovery” error. According to issue >>>>>>>>>>> #103 on github (https://github.com/open-mpi/hwloc/issues/103) this >>>>>>>>>>> error was discussed before and it was possible to sort it out in >>>>>>>>>>> “insert_object_by_parent”, is this still considered? If not, what >>>>>>>>>>> (top level) hwloc API call should we look for in the SLURM sources >>>>>>>>>>> to start debugging? Any help will be most welcome. >>>>>>>>>>> >>>>>>>>>>> Kind regards, >>>>>>>>>>> >>>>>>>>>>> Pim Schellart >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16460.php