[OMPI devel] Collective SM memory affinity possible error

2012-12-04 Thread Juan A. Rico
I am studying the collective SM code and I think that there is a small error in 
the code related to how the memory affinity is achieved. I attach a patch file 
(from subversion revision 27653), I hope it can be useful:

Index: coll_sm_module.c
===
--- coll_sm_module.c(revisión: 27653)
+++ coll_sm_module.c(copia de trabajo)
@@ -434,7 +434,7 @@
 maffinity[j].mbs_len = c->sm_fragment_size;
 maffinity[j].mbs_start_addr = 
 data->mcb_data_index[i].mcbmi_data +
-(rank * c->sm_control_size);
+(rank * c->sm_fragment_size);
 ++j;
 #endif
 }


Regards,
Juan A. Rico




______
Juan Antonio Rico Gallego
Dept. Computer Systems Engineering and Telematics
University of Extremadura
E. U. Politécnica
10003, Cáceres
Tlf.: (+34) 927 25 7200 ext. 51655
jar...@unex.es
http://gim.unex.es/azequiampi









Re: [OMPI devel] SM component init unload

2012-07-04 Thread Juan A. Rico
Thanks all of you for your time and early responses.

After applying the patch, SM can be used by raising its priority. It is enough 
for me (I hope so). But it continues failing when I specify --mca coll sm,self 
in the command line (with tuned too).
I am not going to use this release in production, only for playing with the 
code :-)

Regards,
Juan Antonio.

El 04/07/2012, a las 02:59, George Bosilca escribió:

> Juan,
> 
> Something weird is going on there. The selection mechanism for the SM coll 
> and SM BTL should be very similar. However, the SM BTL successfully select 
> itself while the SM coll fails to determine that all processes are local.
> 
> In the coll SM the issue is that the remote procs do not have the LOCAL flag 
> set, even when they are on the local node (however the ompi_proc_local() 
> return has a special flag stating that all processes in the job are local). I 
> compared the initialization of the SM BTL and the SM coll. It turns out that 
> somehow the procs returned by ompi_proc_all() and the procs provided to the 
> add_proc of the BTLs are not identical. The second have the local flag 
> correctly set, so I went a little bit deeper.
> 
> Here is what I found while toying with gdb inside:
> 
> breakpoint 1, mca_coll_sm_init_query (enable_progress_threads=false, 
> enable_mpi_threads=false) at coll_sm_module.c:132
> 
> (gdb) p procs[0]
> $1 = (ompi_proc_t *) 0x109a1e8c0
> (gdb) p procs[1]
> $2 = (ompi_proc_t *) 0x109a1e970
> (gdb) p procs[0]->proc_flags
> $3 = 0
> (gdb) p procs[1]->proc_flags
> $4 = 4095
> 
> Breakpoint 2, mca_btl_sm_add_procs (btl=0x109baa1c0, nprocs=2, 
> procs=0x109a319e0, peers=0x109a319f0, reachability=0x7fff691378e8) at 
> btl_sm.c:427
> 
> (gdb) p procs[0]
> $5 = (struct ompi_proc_t *) 0x109a1e8c0
> (gdb) p procs[1]
> $6 = (struct ompi_proc_t *) 0x109a1e970
> (gdb) p procs[0]->proc_flags
> $7 = 1920
> (gdb) p procs[1]->proc_flags
> $8 = 4095
> 
> Thus the problem seems to come from the fact that during the initialization 
> of the SM coll the flags are not correctly set. However, this is somehow 
> expected … as the call to the initialization happens before the exchange of 
> the business cards (and therefore there is no way to have any knowledge about 
> the remote procs).
> 
> So, either something changed drastically in the way we set the flags for 
> remote processes or we did not use the SM coll for the last 3 years. I think 
> the culprit is r21967 (https://svn.open-mpi.org/trac/ompi/changeset/21967) 
> who added a "selection" logic based on knowledge about remote procs in the 
> coll SM initialization function. But this selection logic was way to early !!!
> 
> I would strongly encourage you not to use this SM collective component in 
> anything related to production runs.
> 
>   george.
> 
> PS: However, if you want to toy with the SM coll apply the following patch:
> Index: coll_sm_module.c
> ===
> --- coll_sm_module.c  (revision 26737)
> +++ coll_sm_module.c  (working copy)
> @@ -128,6 +128,7 @@
>  int mca_coll_sm_init_query(bool enable_progress_threads,
> bool enable_mpi_threads)
>  {
> +#if 0
>  ompi_proc_t *my_proc, **procs;
>  size_t i, size;
>  
> @@ -158,7 +159,7 @@
>  "coll:sm:init_query: no other local procs; 
> disqualifying myself");
>  return OMPI_ERR_NOT_AVAILABLE;
>  }
> -
> +#endif
>  /* Don't do much here because we don't really want to allocate any
> shared memory until this component is selected to be used. */
>  opal_output_verbose(10, mca_coll_base_output,
> 
> 
> 
> 
> 
> On Jul 4, 2012, at 02:05 , Ralph Castain wrote:
> 
>> Okay, please try this again with r26739 or above. You can remove the rest of 
>> the "verbose" settings and the --display-map so we declutter the output. 
>> Please add "-mca orte_nidmap_verbose 20" to your cmd line.
>> 
>> Thanks!
>> Ralph
>> 
>> 
>> On Tue, Jul 3, 2012 at 1:50 PM, Juan A. Rico <jar...@unex.es> wrote:
>> Here is the output.
>> 
>> [jarico@Metropolis-01 examples]$ 
>> /home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --bind-to-core 
>> --bynode --mca mca_base_verbose 100 --mca mca_coll_base_output 100  --mca 
>> coll_sm_priority 99 -mca hwloc_base_verbose 90 --display-map --mca 
>> mca_verbose 100 --mca mca_base_verbose 100 --mca coll_base_verbose 100 -n 2 
>> -mca grpcomm_base_verbose 5 ./bmem
>> [Metropolis-01:24563] mca: base: components_open: Looking for hwloc 
>> components
>> [Metropolis-0

Re: [OMPI devel] SM component init unload

2012-07-03 Thread Juan A. Rico
],0] grpcomm:base:receive stop comm
[Metropolis-01:24563] [[36265,0],0] grpcomm:bad:xcast sent to job [36265,0] tag 
1
[Metropolis-01:24563] [[36265,0],0] grpcomm:xcast:recv:send_relay
[Metropolis-01:24563] [[36265,0],0] orte:daemon:send_relay - recipient list is 
empty!
[jarico@Metropolis-01 examples]$ 



El 03/07/2012, a las 21:44, Ralph Castain escribió:

> Interesting - yes, coll sm doesn't think they are on the same node for some 
> reason. Try adding -mca grpcomm_base_verbose 5 and let's see why
> 
> 
> On Jul 3, 2012, at 1:24 PM, Juan Antonio Rico Gallego wrote:
> 
>> The code I run is a simple broadcast. 
>> 
>> When I do not specify components to run, the output is (more verbose):
>> 
>> [jarico@Metropolis-01 examples]$ 
>> /home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --mca 
>> mca_base_verbose 100 --mca mca_coll_base_output 100  --mca coll_sm_priority 
>> 99 -mca hwloc_base_verbose 90 --display-map --mca mca_verbose 100 --mca 
>> mca_base_verbose 100 --mca coll_base_verbose 100 -n 2 ./bmem
>> [Metropolis-01:24490] mca: base: components_open: Looking for hwloc 
>> components
>> [Metropolis-01:24490] mca: base: components_open: opening hwloc components
>> [Metropolis-01:24490] mca: base: components_open: found loaded component 
>> hwloc142
>> [Metropolis-01:24490] mca: base: components_open: component hwloc142 has no 
>> register function
>> [Metropolis-01:24490] mca: base: components_open: component hwloc142 has no 
>> open function
>> [Metropolis-01:24490] hwloc:base:get_topology
>> [Metropolis-01:24490] hwloc:base: no cpus specified - using root available 
>> cpuset
>> 
>>    JOB MAP   
>> 
>> Data for node: Metropolis-01 Num procs: 2
>>  Process OMPI jobid: [36336,1] App: 0 Process rank: 0
>>  Process OMPI jobid: [36336,1] App: 0 Process rank: 1
>> 
>> =
>> [Metropolis-01:24491] mca: base: components_open: Looking for hwloc 
>> components
>> [Metropolis-01:24491] mca: base: components_open: opening hwloc components
>> [Metropolis-01:24491] mca: base: components_open: found loaded component 
>> hwloc142
>> [Metropolis-01:24491] mca: base: components_open: component hwloc142 has no 
>> register function
>> [Metropolis-01:24491] mca: base: components_open: component hwloc142 has no 
>> open function
>> [Metropolis-01:24492] mca: base: components_open: Looking for hwloc 
>> components
>> [Metropolis-01:24492] mca: base: components_open: opening hwloc components
>> [Metropolis-01:24492] mca: base: components_open: found loaded component 
>> hwloc142
>> [Metropolis-01:24492] mca: base: components_open: component hwloc142 has no 
>> register function
>> [Metropolis-01:24492] mca: base: components_open: component hwloc142 has no 
>> open function
>> [Metropolis-01:24491] locality: CL:CU:N:B
>> [Metropolis-01:24491] hwloc:base: get available cpus
>> [Metropolis-01:24491] hwloc:base:get_available_cpus first time - filtering 
>> cpus
>> [Metropolis-01:24491] hwloc:base: no cpus specified - using root available 
>> cpuset
>> [Metropolis-01:24491] hwloc:base:get_available_cpus root object
>> [Metropolis-01:24491] mca: base: components_open: Looking for coll components
>> [Metropolis-01:24491] mca: base: components_open: opening coll components
>> [Metropolis-01:24491] mca: base: components_open: found loaded component 
>> tuned
>> [Metropolis-01:24491] mca: base: components_open: component tuned has no 
>> register function
>> [Metropolis-01:24491] coll:tuned:component_open: done!
>> [Metropolis-01:24491] mca: base: components_open: component tuned open 
>> function successful
>> [Metropolis-01:24491] mca: base: components_open: found loaded component sm
>> [Metropolis-01:24491] mca: base: components_open: component sm register 
>> function successful
>> [Metropolis-01:24491] mca: base: components_open: component sm has no open 
>> function
>> [Metropolis-01:24491] mca: base: components_open: found loaded component 
>> libnbc
>> [Metropolis-01:24491] mca: base: components_open: component libnbc register 
>> function successful
>> [Metropolis-01:24491] mca: base: components_open: component libnbc open 
>> function successful
>> [Metropolis-01:24491] mca: base: components_open: found loaded component 
>> hierarch
>> [Metropolis-01:24491] mca: base: components_open: component hierarch has no 
>> register function
>> [Metropolis-01:24491] mca: base: components_open: component hierarch open 
>> fu

Re: [OMPI devel] SM component init unload

2012-07-03 Thread Juan Antonio Rico Gallego
 not available: self
[Metropolis-01:24491] coll:tuned:module_init called.
[Metropolis-01:24491] coll:tuned:module_init Tuned is in use
[Metropolis-01:24491] coll:base:comm_select: new communicator: MPI_COMM_SELF 
(cid 1)
[Metropolis-01:24491] coll:base:comm_select: Checking all available modules
[Metropolis-01:24491] coll:tuned:module_tuned query called
[Metropolis-01:24491] coll:base:comm_select: component not available: tuned
[Metropolis-01:24491] coll:base:comm_select: component available: libnbc, 
priority: 10
[Metropolis-01:24491] coll:base:comm_select: component not available: hierarch
[Metropolis-01:24491] coll:base:comm_select: component available: basic, 
priority: 10
[Metropolis-01:24491] coll:base:comm_select: component not available: inter
[Metropolis-01:24491] coll:base:comm_select: component available: self, 
priority: 75
[Metropolis-01:24492] coll:base:comm_select: new communicator: MPI_COMM_WORLD 
(cid 0)
[Metropolis-01:24492] coll:base:comm_select: Checking all available modules
[Metropolis-01:24492] coll:tuned:module_tuned query called
[Metropolis-01:24492] coll:base:comm_select: component available: tuned, 
priority: 30
[Metropolis-01:24492] coll:base:comm_select: component available: libnbc, 
priority: 10
[Metropolis-01:24492] coll:base:comm_select: component not available: hierarch
[Metropolis-01:24492] coll:base:comm_select: component available: basic, 
priority: 10
[Metropolis-01:24492] coll:base:comm_select: component not available: inter
[Metropolis-01:24492] coll:base:comm_select: component not available: self
[Metropolis-01:24492] coll:tuned:module_init called.
[Metropolis-01:24492] coll:tuned:module_init Tuned is in use
[Metropolis-01:24492] coll:base:comm_select: new communicator: MPI_COMM_SELF 
(cid 1)
[Metropolis-01:24492] coll:base:comm_select: Checking all available modules
[Metropolis-01:24492] coll:tuned:module_tuned query called
[Metropolis-01:24492] coll:base:comm_select: component not available: tuned
[Metropolis-01:24492] coll:base:comm_select: component available: libnbc, 
priority: 10
[Metropolis-01:24492] coll:base:comm_select: component not available: hierarch
[Metropolis-01:24492] coll:base:comm_select: component available: basic, 
priority: 10
[Metropolis-01:24492] coll:base:comm_select: component not available: inter
[Metropolis-01:24492] coll:base:comm_select: component available: self, 
priority: 75
[Metropolis-01:24491] coll:tuned:component_close: called
[Metropolis-01:24491] coll:tuned:component_close: done!
[Metropolis-01:24492] coll:tuned:component_close: called
[Metropolis-01:24492] coll:tuned:component_close: done!
[Metropolis-01:24492] mca: base: close: component tuned closed
[Metropolis-01:24492] mca: base: close: unloading component tuned
[Metropolis-01:24492] mca: base: close: component libnbc closed
[Metropolis-01:24492] mca: base: close: unloading component libnbc
[Metropolis-01:24492] mca: base: close: unloading component hierarch
[Metropolis-01:24492] mca: base: close: unloading component basic
[Metropolis-01:24492] mca: base: close: unloading component inter
[Metropolis-01:24492] mca: base: close: unloading component self
[Metropolis-01:24491] mca: base: close: component tuned closed
[Metropolis-01:24491] mca: base: close: unloading component tuned
[Metropolis-01:24491] mca: base: close: component libnbc closed
[Metropolis-01:24491] mca: base: close: unloading component libnbc
[Metropolis-01:24491] mca: base: close: unloading component hierarch
[Metropolis-01:24491] mca: base: close: unloading component basic
[Metropolis-01:24491] mca: base: close: unloading component inter
[Metropolis-01:24491] mca: base: close: unloading component self
[jarico@Metropolis-01 examples]$ 


SM is not load because it detects no other processes in the same machine:

[Metropolis-01:24491] coll:sm:init_query: no other local procs; disqualifying 
myself

The machine is a multicore machine with 8 cores.

I need to run SM component code, and I suppose that raising priority it will be 
the component selected when problem is solved.



El 03/07/2012, a las 21:01, Jeff Squyres escribió:

> The issue is that the "sm" coll component only implements a few of the MPI 
> collective operations.  It is usually mixed at run-time with other coll 
> components to fill out the rest of the MPI collective operations.
> 
> So what is happening is that OMPI is determining that it doesn't have 
> implementations of all the MPI collective operations and aborting.
> 
> You shouldn't need to manually select your coll module -- OMPI should 
> automatically select the right collective module for you.  E.g., if all procs 
> are local on a single machine and sm has a matching implementation for that 
> MPI collective operation, it'll be used.
> 
> 
> 
> On Jul 3, 2012, at 2:48 PM, Juan Antonio Rico Gallego wrote:
> 
>> Output is:
>> 
>> [Metropolis-01:15355] hwloc:base:get_topology
>> [Metropolis-01:15355] hwloc:base: no cpus

[OMPI devel] SM component init unload

2012-07-03 Thread Juan Antonio Rico Gallego
lity information */
  proc->proc_flags = 
orte_ess.proc_get_locality(>proc_name);
  /* get the name of the node it is on */
  proc->proc_hostname = 
orte_ess.proc_get_hostname(>proc_name);
}


enough for running ok. But this function has changed and this code does not 
work. I am not sure now what I am doing bad.

Thanks for your time,
Juan A. Rico