Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Jeff Squyres
On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

> In charge of ticket 1888 (see at 
> https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
> I have put the resulting code in bitbucket at:
> http://bitbucket.org/devezep/new-romio-for-openmpi/

Sweet!

> The work in this repo consisted in refreshing ROMIO to a newer
> version: the one from the very last MPICH2 release (mpich2-1.3b1).

Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

> Testing:
> 1. runs fine except one minor error (see the explanation below) on various FS.
> 2. runs fine with Lustre, but:
>. had to add a small patch in romio/adio/ad_lustre_open.c

Did this patch get pushed upstream?

>  The minor error ===
> The test error.c fails because OpenMPI does not handle correctly the
> "two level" error functions of ROMIO:
>   error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
>   myname, __LINE__, MPI_ERR_ARG,
>   "**iobaddisp", 0);
> OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".

Do you mean that we should be returning an error string "**iobaddisp" instead 
of "MPI_ERR_ARG"?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Jeff Squyres
Sorry for the delay in replying -- I was in Europe for the past two weeks; 
travel always makes me wy behind on my INBOX...


On Sep 14, 2010, at 9:56 PM, 张晶 wrote:

> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" and 
> some annotation in the source file.  From them , I know ob1 has implemented   
> round-robin& weighted distribution algorithm. But after tracking the 
> MPI_Send(),I cann't figure out 
> the location of these implement ,let alone to add a new schedule algorithm. 
> I have two questions :
> 1.The location of the schedule algorithm ?

It's complicated -- I'd say that the PML is probably among the most complicated 
sections of Open MPI because it is the main "engine" that enforces the MPI 
point-to-point semantics.  The algorithm is fairly well distribute throughout 
the PML source code.  :-\

> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The 
> function of these components?

cm: this component drives the MTL point-to-point components.  It is mainly a 
thin wrapper for network transports that provide their own MPI-like matching 
semantics.  Hence, most of the MPI semantics are effectively done in the lower 
layer (i.e., in the MTL components and their dependent libraries).  You 
probably won't be able to do much here, because such transports (MX, Portals, 
etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
you have a matching network layer, this is the PML that you probably use (MX, 
Portals, PSM).

crcpw: this is a fork of the ob1 PML; it add some failover semantics.

csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
you can tell if the underlying transport had an error).

v: this PML uses logging and replay to effect some level of fault tolerance.  
It's a distant fork of the ob1 PML, but has quite a few significant differences.

ob1: this is the "main" PML that most users use (TCP, shared memory, 
OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
messages across individual network transports.  Hence, it supports true 
multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a thin 
management later that marshals all the BTLs in the process together -- it's 
mainly array handling, etc.  The ob1 PML is the one that decides 
multi-rail/device splitting, etc.  The INRIA folks just published a paper last 
week at Euro MPI about adjusting the ob1 scheduling algorithm to also take 
NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations.

Hope this helps!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Jeff Squyres
On Sep 22, 2010, at 8:00 AM, Jeff Squyres wrote:

> crcpw: this is a fork of the ob1 PML; it add some failover semantics.

Oops!  I messed this up:

bfo is the one I meant to write up there -- it's a fork of ob1; it adds 
failover semantics.

I don't know exactly what crcpw is -- I suspect this is a Josh creation for 
some kind of fault tolerance...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Pascal Deveze

Jeff Squyres a écrit :

On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

  

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/



Sweet!

  

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).



Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

  
I also see this new release (mpich2-1.3rc1). I am going to report the 
modifications and inform the list.

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

  

Testing:
1. runs fine except one minor error (see the explanation below) on various FS.
2. runs fine with Lustre, but:
   . had to add a small patch in romio/adio/ad_lustre_open.c



Did this patch get pushed upstream?

  
This patch has been integrated yesterday in mpich2-1.3rc1 with another 
patch in romio/adio/common/lock.c. They will be available very soon in 
bitbucket.

 The minor error ===
The test error.c fails because OpenMPI does not handle correctly the
"two level" error functions of ROMIO:
  error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
  myname, __LINE__, MPI_ERR_ARG,
  "**iobaddisp", 0);
OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".



Do you mean that we should be returning an error string "**iobaddisp" instead of 
"MPI_ERR_ARG"?

  
In MPICH2, they have a file mpi/errhan/errnames.txt that will generate 
mpi/errhan/errnames.h making the links between codes
like "**iobaddisp" and the corresponding error string "Invalid 
displacement argument".
The error.c program tests the presence of "displacement" in the error 
string.


With OpenMPI ,the error message is:
" MPI_ERR_ARG: invalid argument of some other kind"

With MPICH2 , the error message is :
"Invalid argument, error stack:
MPI_FILE_SET_VIEW(60): Invalid displacement argument"

It would be better if OpenMPI displays at least the "Invalid 
displacement argument" message.

This is not a new problem in OpenMPI, it was also the case in the trunk.



Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Joshua Hursey
crcpw is a wrapper around the PML to support coordinated checkpoint restart. It 
mostly just replays the call to the 'crcp' framework that tracks the signature 
of messages traveling through the system.

If you are not using the C/R feature, then I would not worry about the crcpw 
PML component (it is disabled automatically in non-CR builds).

-- Josh

On Sep 22, 2010, at 8:44 AM, Jeff Squyres wrote:

> On Sep 22, 2010, at 8:00 AM, Jeff Squyres wrote:
> 
>> crcpw: this is a fork of the ob1 PML; it add some failover semantics.
> 
> Oops!  I messed this up:
> 
> bfo is the one I meant to write up there -- it's a fork of ob1; it adds 
> failover semantics.
> 
> I don't know exactly what crcpw is -- I suspect this is a Josh creation for 
> some kind of fault tolerance...?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey




Re: [OMPI devel] Barrier() after Finalize() when a file handle is leaked.

2010-09-22 Thread Jeff Squyres
Thanks Lisandro!

I filed https://svn.open-mpi.org/trac/ompi/ticket/2594 about this.


On Sep 15, 2010, at 11:28 AM, Lisandro Dalcin wrote:

> I've tested this with (--enable-debug --enable-picky
> --enable-mem-debug) 1.4.2 and 1.5rc6. Despite being debug builds, a
> mpi4py user got the same with (likely release) builds in both Ubuntu
> and OS X.
> 
> $ cat open.c
> #include 
> int main(int argc, char *argv[]) {
>  MPI_File f;
>  MPI_Init(&argc, &argv);
>  MPI_File_open(MPI_COMM_WORLD, "test.plt", MPI_MODE_RDONLY, MPI_INFO_NULL, 
> &f);
>  /* MPI_File_close(&f); */
>  MPI_Finalize();
>  return 0;
> }
> 
> $ mpicc open.c
> 
> $ ./a.out
> *** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [trantor:15145] Abort after MPI_FINALIZE completed successfully; not
> able to guarantee that all other processes were killed!
> 
> 
> So if you open a file but never close it, a MPI_Barrier() gets called
> after MPI_Finalize(). Could that come from a finalizer ROMIO callback?
> However, I do not get this failure with MPICH2, and Open MPI seems to
> behave just fine regarding MPI_Finalized(), the code below work as
> expected:
> 
> #include 
> #include 
> 
> static int atexitmpi(MPI_Comm comm, int k, void *v, void *xs) {
>  int flag;
>  MPI_Finalized(&flag);
>  printf("atexitmpi: finalized=%d\n", flag);
>  MPI_Barrier(MPI_COMM_WORLD);
> }
> 
> int main(int argc, char *argv[]) {
>  int keyval = MPI_KEYVAL_INVALID;
>  MPI_Init(&argc, &argv);
>  MPI_Comm_create_keyval(MPI_COMM_NULL_COPY_FN, atexitmpi, &keyval, 0);
>  MPI_Comm_set_attr(MPI_COMM_SELF, keyval, 0);
>  MPI_Finalize();
>  return 0;
> }
> 
> 
> 
> -- 
> Lisandro Dalcin
> ---
> CIMEC (INTEC/CONICET-UNL)
> Predio CONICET-Santa Fe
> Colectora RN 168 Km 472, Paraje El Pozo
> Tel: +54-342-4511594 (ext 1011)
> Tel/Fax: +54-342-4511169
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Jeff Squyres
Some of you may be unaware that recent versions of automake can run in 
parallel.  That is, automake will run in parallel with a degree of (at most) 
$AUTOMAKE_JOBS.  This can speed up the execution time of autogen.pl quite a bit 
on some platforms.  On my cluster at cisco, here's a few quick timings of the 
entire autogen.pl process (of which, automake is the bottleneck):

$AUTOMAKE_JOBS   Total wall time
valueof autogen.pl
83:01.46
42:55.57
23:28.09
14:38.44

This is an older Xeon machine with 2 sockets, each with 2 cores.

There's a nice performance jump from 1 to 2, and a smaller jump from 2 to 4.  4 
and 8 are close enough to not matter.  YMMV.

I just committed a heuristic to autogen.pl to setenv AUTOMAKE_JOBS if it is not 
already set (https://svn.open-mpi.org/trac/ompi/changeset/23788):

- If lstopo is found in your $PATH, runs it and count how many PU's (processing 
units) you have.  It'll set AUTOMAKE_JOBS to that number, or a maximum of 4 
(which is admittedly a further heuristic).  
- If lstopo is not found, it just sets AUTOMAKE_JOBS to 2.

Enjoy.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Pascal Deveze
I just commited the very last modifications of ROMIO (mpich2-1.3rc1) 
into bitbucket.


Pascal

Jeff Squyres a écrit :

On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

  

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/



Sweet!

  

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).



Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

  

Testing:
1. runs fine except one minor error (see the explanation below) on various FS.
2. runs fine with Lustre, but:
   . had to add a small patch in romio/adio/ad_lustre_open.c



Did this patch get pushed upstream?

  

 The minor error ===
The test error.c fails because OpenMPI does not handle correctly the
"two level" error functions of ROMIO:
  error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
  myname, __LINE__, MPI_ERR_ARG,
  "**iobaddisp", 0);
OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".



Do you mean that we should be returning an error string "**iobaddisp" instead of 
"MPI_ERR_ARG"?

  




Re: [OMPI devel] Question regarding recently common shared-memory component

2010-09-22 Thread Jeff Squyres
On Sep 21, 2010, at 12:37 PM,   
wrote:

> Like I said in my earlier response, I have never tried this option. So I ran 
> these tests on 1.4.2 now and apparently the behavior is same ie; the 
> checkpoint creation time increases when I enable shared memory componentL

I don't have huge experience with the checkpoint/restart stuff, but this is 
probably not a surprising result because the checkpoint will now need to 
include the shared memory stuff. Are the checkpoint images larger?  (at least: 
is one of them noticeably larger?)  That might account for the checkpoint 
performance difference.

> Is there any parameter that can be tuned to improve the performance?

My understanding is that there are some inherent bottlenecks in checkpoint / 
restart, such as the time required to dump out all the process images to disk.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Kenneth Lloyd
Jeff,

Is that EuroMPI2010 ob1 paper publicly available? I get involved in various 
NUMA partitioning/architecting studies and it seems there is not a lot of 
discussion in this area.

Ken Lloyd

==
Kenneth A. Lloyd
Watt Systems Technologies Inc.



-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Wednesday, September 22, 2010 6:00 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml

Sorry for the delay in replying -- I was in Europe for the past two weeks; 
travel always makes me wy behind on my INBOX...


On Sep 14, 2010, at 9:56 PM, 张晶 wrote:

> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" and 
> some annotation in the source file.  From them , I know ob1 has implemented   
> round-robin& weighted distribution algorithm. But after tracking the 
> MPI_Send(),I cann't figure out 
> the location of these implement ,let alone to add a new schedule algorithm. 
> I have two questions :
> 1.The location of the schedule algorithm ?

It's complicated -- I'd say that the PML is probably among the most complicated 
sections of Open MPI because it is the main "engine" that enforces the MPI 
point-to-point semantics.  The algorithm is fairly well distribute throughout 
the PML source code.  :-\

> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The 
> function of these components?

cm: this component drives the MTL point-to-point components.  It is mainly a 
thin wrapper for network transports that provide their own MPI-like matching 
semantics.  Hence, most of the MPI semantics are effectively done in the lower 
layer (i.e., in the MTL components and their dependent libraries).  You 
probably won't be able to do much here, because such transports (MX, Portals, 
etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
you have a matching network layer, this is the PML that you probably use (MX, 
Portals, PSM).

crcpw: this is a fork of the ob1 PML; it add some failover semantics.

csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
you can tell if the underlying transport had an error).

v: this PML uses logging and replay to effect some level of fault tolerance.  
It's a distant fork of the ob1 PML, but has quite a few significant differences.

ob1: this is the "main" PML that most users use (TCP, shared memory, 
OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
messages across individual network transports.  Hence, it supports true 
multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a thin 
management later that marshals all the BTLs in the process together -- it's 
mainly array handling, etc.  The ob1 PML is the one that decides 
multi-rail/device splitting, etc.  The INRIA folks just published a paper last 
week at Euro MPI about adjusting the ob1 scheduling algorithm to also take 
NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations.

Hope this helps!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Jeff Squyres
I see it here:

http://hal.archives-ouvertes.fr/inria-00486178/en/



On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote:

> Jeff,
> 
> Is that EuroMPI2010 ob1 paper publicly available? I get involved in various 
> NUMA partitioning/architecting studies and it seems there is not a lot of 
> discussion in this area.
> 
> Ken Lloyd
> 
> ==
> Kenneth A. Lloyd
> Watt Systems Technologies Inc.
> 
> 
> 
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Jeff Squyres
> Sent: Wednesday, September 22, 2010 6:00 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml
> 
> Sorry for the delay in replying -- I was in Europe for the past two weeks; 
> travel always makes me wy behind on my INBOX...
> 
> 
> On Sep 14, 2010, at 9:56 PM, 张晶 wrote:
> 
>> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
>> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" 
>> and some annotation in the source file.  From them , I know ob1 has 
>> implemented   round-robin& weighted distribution algorithm. But after 
>> tracking the MPI_Send(),I cann't figure out 
>> the location of these implement ,let alone to add a new schedule algorithm. 
>> I have two questions :
>> 1.The location of the schedule algorithm ?
> 
> It's complicated -- I'd say that the PML is probably among the most 
> complicated sections of Open MPI because it is the main "engine" that 
> enforces the MPI point-to-point semantics.  The algorithm is fairly well 
> distribute throughout the PML source code.  :-\
> 
>> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . 
>> The function of these components?
> 
> cm: this component drives the MTL point-to-point components.  It is mainly a 
> thin wrapper for network transports that provide their own MPI-like matching 
> semantics.  Hence, most of the MPI semantics are effectively done in the 
> lower layer (i.e., in the MTL components and their dependent libraries).  You 
> probably won't be able to do much here, because such transports (MX, Portals, 
> etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
> you have a matching network layer, this is the PML that you probably use (MX, 
> Portals, PSM).
> 
> crcpw: this is a fork of the ob1 PML; it add some failover semantics.
> 
> csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
> you can tell if the underlying transport had an error).
> 
> v: this PML uses logging and replay to effect some level of fault tolerance.  
> It's a distant fork of the ob1 PML, but has quite a few significant 
> differences.
> 
> ob1: this is the "main" PML that most users use (TCP, shared memory, 
> OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
> messages across individual network transports.  Hence, it supports true 
> multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a 
> thin management later that marshals all the BTLs in the process together -- 
> it's mainly array handling, etc.  The ob1 PML is the one that decides 
> multi-rail/device splitting, etc.  The INRIA folks just published a paper 
> last week at Euro MPI about adjusting the ob1 scheduling algorithm to also 
> take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth 
> calculations.
> 
> Hope this helps!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Kenneth Lloyd
Thank you very much.

Ken

-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Wednesday, September 22, 2010 10:09 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml

I see it here:

http://hal.archives-ouvertes.fr/inria-00486178/en/



On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote:

> Jeff,
> 
> Is that EuroMPI2010 ob1 paper publicly available? I get involved in various 
> NUMA partitioning/architecting studies and it seems there is not a lot of 
> discussion in this area.
> 
> Ken Lloyd
> 
> ==
> Kenneth A. Lloyd
> Watt Systems Technologies Inc.
> 
> 
> 
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Jeff Squyres
> Sent: Wednesday, September 22, 2010 6:00 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml
> 
> Sorry for the delay in replying -- I was in Europe for the past two weeks; 
> travel always makes me wy behind on my INBOX...
> 
> 
> On Sep 14, 2010, at 9:56 PM, 张晶 wrote:
> 
>> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
>> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" 
>> and some annotation in the source file.  From them , I know ob1 has 
>> implemented   round-robin& weighted distribution algorithm. But after 
>> tracking the MPI_Send(),I cann't figure out 
>> the location of these implement ,let alone to add a new schedule algorithm. 
>> I have two questions :
>> 1.The location of the schedule algorithm ?
> 
> It's complicated -- I'd say that the PML is probably among the most 
> complicated sections of Open MPI because it is the main "engine" that 
> enforces the MPI point-to-point semantics.  The algorithm is fairly well 
> distribute throughout the PML source code.  :-\
> 
>> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . 
>> The function of these components?
> 
> cm: this component drives the MTL point-to-point components.  It is mainly a 
> thin wrapper for network transports that provide their own MPI-like matching 
> semantics.  Hence, most of the MPI semantics are effectively done in the 
> lower layer (i.e., in the MTL components and their dependent libraries).  You 
> probably won't be able to do much here, because such transports (MX, Portals, 
> etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
> you have a matching network layer, this is the PML that you probably use (MX, 
> Portals, PSM).
> 
> crcpw: this is a fork of the ob1 PML; it add some failover semantics.
> 
> csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
> you can tell if the underlying transport had an error).
> 
> v: this PML uses logging and replay to effect some level of fault tolerance.  
> It's a distant fork of the ob1 PML, but has quite a few significant 
> differences.
> 
> ob1: this is the "main" PML that most users use (TCP, shared memory, 
> OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
> messages across individual network transports.  Hence, it supports true 
> multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a 
> thin management later that marshals all the BTLs in the process together -- 
> it's mainly array handling, etc.  The ob1 PML is the one that decides 
> multi-rail/device splitting, etc.  The INRIA folks just published a paper 
> last week at Euro MPI about adjusting the ob1 scheduling algorithm to also 
> take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth 
> calculations.
> 
> Hope this helps!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Ralf Wildenhues
Hi Jeff,

adding bug-automake in Cc: (non-subscribers can't post to the Open MPI
list, so please remove that Cc: in case)

* Jeff Squyres wrote on Wed, Sep 22, 2010 at 03:50:19PM CEST:
> $AUTOMAKE_JOBS   Total wall time
> valueof autogen.pl
> 83:01.46
> 42:55.57
> 23:28.09
> 14:38.44
> 
> This is an older Xeon machine with 2 sockets, each with 2 cores.

Thanks for the measurements!  I'm a bit surprised that the speedup is
not higher.  Do you have timings as to how much of the autogen.pl time
is spent inside automake?

IIRC the pure automake part for OpenMPI would speed up better on bigger
systems, my old numbers from two years ago are here:
http://lists.gnu.org/archive/html/automake-patches/2008-10/msg00055.html

Cheers,
Ralf


Re: [OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Jeff Squyres
On Sep 22, 2010, at 4:51 PM, Ralf Wildenhues wrote:

> Thanks for the measurements!  I'm a bit surprised that the speedup is
> not higher.  Do you have timings as to how much of the autogen.pl time
> is spent inside automake?

No, they didn't.  I re-ran them to just time autoreconf (is there a way to 
extract *just* the time spent in automake in there?).  Here's what I got:

13:57.19
22:43.82
42:13.68
82:13.47

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] RFC: make hwloc first-class data

2010-09-22 Thread Jeff Squyres
WHAT: Make hwloc a 1st class item in OMPI

WHY: At least 2 pieces of new functionality want/need to use the hwloc data

WHERE: Put it in ompi/hwloc

WHEN: Some time in the 1.5 series

TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)



A long time ago, I floated the proposal of putting hwloc at the top level in 
opal so that parts of OPAL/ORTE/OMPI could use the data directly.  I didn't 
have any concrete suggestions at the time about what exactly would use the 
hwloc data -- just a feeling that "someone" would want to.

There are now two solid examples of functionality that want to use hwloc data 
directly:

1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, 
MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right 
ones, but you get the idea).  That is, pre-defined communicators that contain 
all the MPI procs on the same socket as you, the same NUMA node as you, the 
same core as you, ...etc.

2. INRIA presented a paper at Euro MPI last week that takes process distance to 
NICs into account when coming up with the long-message splitting ratio for the 
PML.  E.g., if we have 2 openib NICs with the same bandwidth, don't just assume 
that we'll split long messages 50-50 across both of them.  Instead, use NUMA 
distances to influence calculating the ratio.  See the paper here: 
http://hal.archives-ouvertes.fr/inria-00486178/en/

A previous objection was that we are increasing our dependencies by making 
hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of 
business.  Fair enough.  But that being said, hwloc is getting a bit of a 
community growing around it: vendors are submitting patches for their hardware, 
distros are picking it up, etc.  I certainly can't predict the future, but 
hwloc looks in good shape for now.  There is a little risk in depending on 
hwloc, but I think it's small enough to be ok.

Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for 
embedded environments where hwloc simply takes up space and adds no value).  I 
previously proposed wrapping a subset of the hwloc API with opal_*() functions. 
 After thinking about that a bit, that seems like a lot of work for little 
benefit -- how does one decide *which* subset of hwloc should be wrapped?

Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead 
of opal/hwloc).  Indeed, the 2 places that want to use hwloc are up in the MPI 
layer -- I'm guessing that most functionality that wants hwloc will be up in 
MPI.  And if we do the build system right, we can have paffinity/hwloc and 
libmpi's hwloc all link against the same libhwloc_embedded so that:

a) there's no duplication in the process, and 
b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid 
having hwloc in OPAL/ORTE for embedded environments

(there's a little hand-waving there, but I think we can figure out the details)

We *may* want to refactor paffinity and maffinity someday, but that's not 
necessarily what I'm proposing here.

Comments?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/