Pim, if you configure OpenMPI with --with-hwloc=external (or something like --with-hwloc=/usr) it is very likely OpenMPI will use the same hwloc library (e.g. the "system" library) that is used by SLURM
/* i do not know how Ubuntu packages OpenMPI ... */ The default (e.g. no --with-hwloc parameter in the configure command line) is to use the hwloc library that is embedded within OpenMPI Gilles On 2014/12/09 17:34, Pim Schellart wrote: > Ah, ok so that was where the confusion came from, I did see hwloc in the > SLURM sources but couldn’t immediately figure out where exactly it was used. > We will try compiling openmpi with the embedded hwloc. Any particular flags I > should set? > >> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> wrote: >> >> There is no linkage between slurm and ompi when it comes to hwloc. If you >> directly launch your app using srun, then slurm will use its version of >> hwloc to do the binding. If you use mpirun to launch the app, then we’ll use >> our internal version to do it. >> >> The two are completely isolated from each other. >> >> >>> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> wrote: >>> >>> The version that “lstopo --version” reports is the same (1.8) on all nodes, >>> but we may indeed be hitting the second issue. We can try to compile a new >>> version of openmpi, but how do we ensure that the external programs (e.g. >>> SLURM) are using the same hwloc version as the one embedded in openmpi? Is >>> it enough to just compile hwloc 1.9 separately as well and link against >>> that? Also, if this is an issue, should we file a bug against hwloc or >>> openmpi on Ubuntu for mismatching versions? >>> >>>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> Hmmm…they probably linked that to the external, system hwloc version, so >>>> it sounds like one or more of your nodes has a different hwloc rpm on it. >>>> >>>> I couldn’t leaf thru your output well enough to see all the lstopo >>>> versions, but you might check to ensure they are the same. >>>> >>>> Looking at the code base, you may also hit a problem here. OMPI 1.6 series >>>> was based on hwloc 1.3 - the output you sent indicated you have hwloc 1.8, >>>> which is quite a big change. OMPI 1.8 series is based on hwloc 1.9, so at >>>> least that is closer (though probably still a mismatch). >>>> >>>> Frankly, I’d just download and install an OMPI tarball myself and avoid >>>> these headaches. This mismatch in required versions is why we embed hwloc >>>> as it is a critical library for OMPI, and we had to ensure that the >>>> version matched our internal requirements. >>>> >>>> >>>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> wrote: >>>>> >>>>> It is the default openmpi that comes with Ubuntu 14.04. >>>>> >>>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>> Pim: is this an OMPI you built, or one you were given somehow? If you >>>>>> built it, how did you configure it? >>>>>> >>>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote: >>>>>>> >>>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside the >>>>>>> nodes. The XML warning is related to these restrictions inside the node. >>>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere. >>>>>>> >>>>>>> How do we check after install whether OMPI uses the embedded or the >>>>>>> system-wide hwloc? >>>>>>> >>>>>>> Brice >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit : >>>>>>>> Dear Ralph, >>>>>>>> >>>>>>>> the nodes are called coma## and as you can see in the logs the nodes >>>>>>>> of the broken example are the same as the nodes of the working one, so >>>>>>>> that doesn’t seem to be the cause. Unless (very likely) I’m missing >>>>>>>> something. Anything else I can check? >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Pim >>>>>>>> >>>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>>> >>>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we >>>>>>>>> use, so there is no Slurm interaction to be considered. The most >>>>>>>>> likely cause is that one or more of your nodes is picking up a >>>>>>>>> different version of OMPI. So things “work” if you happen to get >>>>>>>>> nodes where all the versions match, and “fail” when you get a >>>>>>>>> combination that includes a different version. >>>>>>>>> >>>>>>>>> Is there some way you can narrow down your search to find the node(s) >>>>>>>>> that are picking up the different version? >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Dear Brice, >>>>>>>>>> >>>>>>>>>> I am not sure why this is happening since all code seems to be using >>>>>>>>>> the same hwloc library version (1.8) but it does :) An MPI program >>>>>>>>>> is started through SLURM on two nodes with four CPU cores total >>>>>>>>>> (divided over the nodes) using the following script: >>>>>>>>>> >>>>>>>>>> #! /bin/bash >>>>>>>>>> #SBATCH -N 2 -n 4 >>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version >>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml >>>>>>>>>> /usr/bin/mpiexec /path/to/my_mpi_code >>>>>>>>>> >>>>>>>>>> When this is submitted multiple times it gives “out-of-order” >>>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 >>>>>>>>>> cases. I attached the output (with xml) for both the working and >>>>>>>>>> `broken` case. Note that the xml is of course printed (differently) >>>>>>>>>> multiple times for each task/core. As always, any help would be >>>>>>>>>> appreciated. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Pim Schellart >>>>>>>>>> >>>>>>>>>> P.S. $ mpirun --version >>>>>>>>>> mpirun (Open MPI) 1.6.5 >>>>>>>>>> >>>>>>>>>> <broken.log><working.log> >>>>>>>>>> >>>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hello >>>>>>>>>>> The github issue you're refering to was closed 18 months ago. The >>>>>>>>>>> warning (it's not an error) is only supposed to appear if you're >>>>>>>>>>> importing in a recent hwloc a XML that was exported from a old >>>>>>>>>>> hwloc. I >>>>>>>>>>> don't see how that could happen when using Open MPI since the hwloc >>>>>>>>>>> versions on both sides is the same. >>>>>>>>>>> Make sure you're not confusing with another error described here >>>>>>>>>>> >>>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error >>>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc versions as >>>>>>>>>>> well >>>>>>>>>>> as the XML lstopo output on the nodes that raise the warning (lstopo >>>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as >>>>>>>>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org >>>>>>>>>>> Thanks >>>>>>>>>>> Brice >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit : >>>>>>>>>>>> Dear OpenMPI developers, >>>>>>>>>>>> >>>>>>>>>>>> this might be a bit off topic but when using the SLURM scheduler >>>>>>>>>>>> (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc >>>>>>>>>>>> sometimes gives a "out-of-order topology discovery” error. >>>>>>>>>>>> According to issue #103 on github >>>>>>>>>>>> (https://github.com/open-mpi/hwloc/issues/103) this error was >>>>>>>>>>>> discussed before and it was possible to sort it out in >>>>>>>>>>>> “insert_object_by_parent”, is this still considered? If not, what >>>>>>>>>>>> (top level) hwloc API call should we look for in the SLURM sources >>>>>>>>>>>> to start debugging? Any help will be most welcome. >>>>>>>>>>>> >>>>>>>>>>>> Kind regards, >>>>>>>>>>>> >>>>>>>>>>>> Pim Schellart >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16460.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16464.php