devel-boun...@open-mpi.org wrote on 02/09/2012 12:20:41 PM: > De : Brice Goglin <brice.gog...@inria.fr> > A : Open MPI Developers <de...@open-mpi.org> > Date : 02/09/2012 12:20 PM > Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see > processes as bound if the job has been launched by srun > Envoyé par : devel-boun...@open-mpi.org > > By default, hwloc only shows what's inside the current cpuset. There's > an option to show everything instead (topology flag).
So may be using that flag inside opal_paffinity_base_get_processor_info() would be a better fix than the one I'm proposing in my patch. I found a bunch of other places where things are managed as in get_ib_dev_distance(). Just doing a grep in the sources, I could find: . init_maffinity() in btl/sm/btl_sm.c . vader_init_maffinity() in btl/vader/btl_vader.c . get_ib_dev_distance() in btl/wv/btl_wv_component.c So I think the flag Brice is talking about should definitely be the fix. Regards, Nadia > > Brice > > > > Le 09/02/2012 12:18, Jeff Squyres a écrit : > > Just so that I understand this better -- if a process is bound in > a cpuset, will tools like hwloc's lstopo only show the Linux > processors *in that cpuset*? I.e., does it not have any visibility > of the processors outside of its cpuset? > > > > > > On Jan 27, 2012, at 11:38 AM, nadia.derbey wrote: > > > >> Hi, > >> > >> If a job is launched using "srun --resv-ports --cpu_bind:..." and slurm > >> is configured with: > >> TaskPlugin=task/affinity > >> TaskPluginParam=Cpusets > >> > >> each rank of that job is in a cpuset that contains a single CPU. > >> > >> Now, if we use carto on top of this, the following happens in > >> get_ib_dev_distance() (in btl/openib/btl_openib_component.c): > >> . opal_paffinity_base_get_processor_info() is called to get the > >> number of logical processors (we get 1 due to the singleton cpuset) > >> . we loop over that # of processors to check whether our process is > >> bound to one of them. In our case the loop will be executed only > >> once and we will never get the correct binding information. > >> . if the process is bound actually get the distance to the device. > >> in our case we won't execute that part of the code. > >> > >> The attached patch is a proposal to fix the issue. > >> > >> Regards, > >> Nadia > >> <get_ib_dev_distance.patch>_______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel