Dear Brice, I am not sure why this is happening since all code seems to be using the same hwloc library version (1.8) but it does :) An MPI program is started through SLURM on two nodes with four CPU cores total (divided over the nodes) using the following script:
#! /bin/bash #SBATCH -N 2 -n 4 /usr/bin/mpiexec /usr/bin/lstopo --version /usr/bin/mpiexec /usr/bin/lstopo --of xml /usr/bin/mpiexec /path/to/my_mpi_code When this is submitted multiple times it gives “out-of-order” warnings in about 9/10 cases but works without warnings in 1/10 cases. I attached the output (with xml) for both the working and `broken` case. Note that the xml is of course printed (differently) multiple times for each task/core. As always, any help would be appreciated. Regards, Pim Schellart P.S. $ mpirun --version mpirun (Open MPI) 1.6.5
broken.log
Description: Binary data
working.log
Description: Binary data
> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote: > > Hello > The github issue you're refering to was closed 18 months ago. The > warning (it's not an error) is only supposed to appear if you're > importing in a recent hwloc a XML that was exported from a old hwloc. I > don't see how that could happen when using Open MPI since the hwloc > versions on both sides is the same. > Make sure you're not confusing with another error described here > > http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error > Otherwise please report the exact Open MPI and/or hwloc versions as well > as the XML lstopo output on the nodes that raise the warning (lstopo > foo.xml). Send these to hwloc mailing lists such as > hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org > Thanks > Brice > > > Le 07/12/2014 13:29, Pim Schellart a écrit : >> Dear OpenMPI developers, >> >> this might be a bit off topic but when using the SLURM scheduler (with >> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a >> "out-of-order topology discovery” error. According to issue #103 on github >> (https://github.com/open-mpi/hwloc/issues/103) this error was discussed >> before and it was possible to sort it out in “insert_object_by_parent”, is >> this still considered? If not, what (top level) hwloc API call should we >> look for in the SLURM sources to start debugging? Any help will be most >> welcome. >> >> Kind regards, >> >> Pim Schellart >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >