[OMPI devel] RFC: make hwloc first-class data

2010-09-22 Thread Jeff Squyres
WHAT: Make hwloc a 1st class item in OMPI

WHY: At least 2 pieces of new functionality want/need to use the hwloc data

WHERE: Put it in ompi/hwloc

WHEN: Some time in the 1.5 series

TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)



A long time ago, I floated the proposal of putting hwloc at the top level in 
opal so that parts of OPAL/ORTE/OMPI could use the data directly.  I didn't 
have any concrete suggestions at the time about what exactly would use the 
hwloc data -- just a feeling that "someone" would want to.

There are now two solid examples of functionality that want to use hwloc data 
directly:

1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, 
MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right 
ones, but you get the idea).  That is, pre-defined communicators that contain 
all the MPI procs on the same socket as you, the same NUMA node as you, the 
same core as you, ...etc.

2. INRIA presented a paper at Euro MPI last week that takes process distance to 
NICs into account when coming up with the long-message splitting ratio for the 
PML.  E.g., if we have 2 openib NICs with the same bandwidth, don't just assume 
that we'll split long messages 50-50 across both of them.  Instead, use NUMA 
distances to influence calculating the ratio.  See the paper here: 
http://hal.archives-ouvertes.fr/inria-00486178/en/

A previous objection was that we are increasing our dependencies by making 
hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of 
business.  Fair enough.  But that being said, hwloc is getting a bit of a 
community growing around it: vendors are submitting patches for their hardware, 
distros are picking it up, etc.  I certainly can't predict the future, but 
hwloc looks in good shape for now.  There is a little risk in depending on 
hwloc, but I think it's small enough to be ok.

Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for 
embedded environments where hwloc simply takes up space and adds no value).  I 
previously proposed wrapping a subset of the hwloc API with opal_*() functions. 
 After thinking about that a bit, that seems like a lot of work for little 
benefit -- how does one decide *which* subset of hwloc should be wrapped?

Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead 
of opal/hwloc).  Indeed, the 2 places that want to use hwloc are up in the MPI 
layer -- I'm guessing that most functionality that wants hwloc will be up in 
MPI.  And if we do the build system right, we can have paffinity/hwloc and 
libmpi's hwloc all link against the same libhwloc_embedded so that:

a) there's no duplication in the process, and 
b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid 
having hwloc in OPAL/ORTE for embedded environments

(there's a little hand-waving there, but I think we can figure out the details)

We *may* want to refactor paffinity and maffinity someday, but that's not 
necessarily what I'm proposing here.

Comments?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-devel] Create success (hwloc r1.1a1r2491)

2010-09-22 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.1a1r2491
Start time: Wed Sep 22 21:01:04 EDT 2010
End time:   Wed Sep 22 21:03:14 EDT 2010

Your friendly daemon,
Cyrador


Re: [OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Jeff Squyres
On Sep 22, 2010, at 4:51 PM, Ralf Wildenhues wrote:

> Thanks for the measurements!  I'm a bit surprised that the speedup is
> not higher.  Do you have timings as to how much of the autogen.pl time
> is spent inside automake?

No, they didn't.  I re-ran them to just time autoreconf (is there a way to 
extract *just* the time spent in automake in there?).  Here's what I got:

13:57.19
22:43.82
42:13.68
82:13.47

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Ralf Wildenhues
Hi Jeff,

adding bug-automake in Cc: (non-subscribers can't post to the Open MPI
list, so please remove that Cc: in case)

* Jeff Squyres wrote on Wed, Sep 22, 2010 at 03:50:19PM CEST:
> $AUTOMAKE_JOBS   Total wall time
> valueof autogen.pl
> 83:01.46
> 42:55.57
> 23:28.09
> 14:38.44
> 
> This is an older Xeon machine with 2 sockets, each with 2 cores.

Thanks for the measurements!  I'm a bit surprised that the speedup is
not higher.  Do you have timings as to how much of the autogen.pl time
is spent inside automake?

IIRC the pure automake part for OpenMPI would speed up better on bigger
systems, my old numbers from two years ago are here:
http://lists.gnu.org/archive/html/automake-patches/2008-10/msg00055.html

Cheers,
Ralf


Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Kenneth Lloyd
Thank you very much.

Ken

-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Wednesday, September 22, 2010 10:09 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml

I see it here:

http://hal.archives-ouvertes.fr/inria-00486178/en/



On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote:

> Jeff,
> 
> Is that EuroMPI2010 ob1 paper publicly available? I get involved in various 
> NUMA partitioning/architecting studies and it seems there is not a lot of 
> discussion in this area.
> 
> Ken Lloyd
> 
> ==
> Kenneth A. Lloyd
> Watt Systems Technologies Inc.
> 
> 
> 
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Jeff Squyres
> Sent: Wednesday, September 22, 2010 6:00 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml
> 
> Sorry for the delay in replying -- I was in Europe for the past two weeks; 
> travel always makes me wy behind on my INBOX...
> 
> 
> On Sep 14, 2010, at 9:56 PM, 张晶 wrote:
> 
>> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
>> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" 
>> and some annotation in the source file.  From them , I know ob1 has 
>> implemented   round-robin& weighted distribution algorithm. But after 
>> tracking the MPI_Send(),I cann't figure out 
>> the location of these implement ,let alone to add a new schedule algorithm. 
>> I have two questions :
>> 1.The location of the schedule algorithm ?
> 
> It's complicated -- I'd say that the PML is probably among the most 
> complicated sections of Open MPI because it is the main "engine" that 
> enforces the MPI point-to-point semantics.  The algorithm is fairly well 
> distribute throughout the PML source code.  :-\
> 
>> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . 
>> The function of these components?
> 
> cm: this component drives the MTL point-to-point components.  It is mainly a 
> thin wrapper for network transports that provide their own MPI-like matching 
> semantics.  Hence, most of the MPI semantics are effectively done in the 
> lower layer (i.e., in the MTL components and their dependent libraries).  You 
> probably won't be able to do much here, because such transports (MX, Portals, 
> etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
> you have a matching network layer, this is the PML that you probably use (MX, 
> Portals, PSM).
> 
> crcpw: this is a fork of the ob1 PML; it add some failover semantics.
> 
> csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
> you can tell if the underlying transport had an error).
> 
> v: this PML uses logging and replay to effect some level of fault tolerance.  
> It's a distant fork of the ob1 PML, but has quite a few significant 
> differences.
> 
> ob1: this is the "main" PML that most users use (TCP, shared memory, 
> OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
> messages across individual network transports.  Hence, it supports true 
> multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a 
> thin management later that marshals all the BTLs in the process together -- 
> it's mainly array handling, etc.  The ob1 PML is the one that decides 
> multi-rail/device splitting, etc.  The INRIA folks just published a paper 
> last week at Euro MPI about adjusting the ob1 scheduling algorithm to also 
> take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth 
> calculations.
> 
> Hope this helps!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Jeff Squyres
I see it here:

http://hal.archives-ouvertes.fr/inria-00486178/en/



On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote:

> Jeff,
> 
> Is that EuroMPI2010 ob1 paper publicly available? I get involved in various 
> NUMA partitioning/architecting studies and it seems there is not a lot of 
> discussion in this area.
> 
> Ken Lloyd
> 
> ==
> Kenneth A. Lloyd
> Watt Systems Technologies Inc.
> 
> 
> 
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Jeff Squyres
> Sent: Wednesday, September 22, 2010 6:00 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml
> 
> Sorry for the delay in replying -- I was in Europe for the past two weeks; 
> travel always makes me wy behind on my INBOX...
> 
> 
> On Sep 14, 2010, at 9:56 PM, 张晶 wrote:
> 
>> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
>> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" 
>> and some annotation in the source file.  From them , I know ob1 has 
>> implemented   round-robin& weighted distribution algorithm. But after 
>> tracking the MPI_Send(),I cann't figure out 
>> the location of these implement ,let alone to add a new schedule algorithm. 
>> I have two questions :
>> 1.The location of the schedule algorithm ?
> 
> It's complicated -- I'd say that the PML is probably among the most 
> complicated sections of Open MPI because it is the main "engine" that 
> enforces the MPI point-to-point semantics.  The algorithm is fairly well 
> distribute throughout the PML source code.  :-\
> 
>> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . 
>> The function of these components?
> 
> cm: this component drives the MTL point-to-point components.  It is mainly a 
> thin wrapper for network transports that provide their own MPI-like matching 
> semantics.  Hence, most of the MPI semantics are effectively done in the 
> lower layer (i.e., in the MTL components and their dependent libraries).  You 
> probably won't be able to do much here, because such transports (MX, Portals, 
> etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
> you have a matching network layer, this is the PML that you probably use (MX, 
> Portals, PSM).
> 
> crcpw: this is a fork of the ob1 PML; it add some failover semantics.
> 
> csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
> you can tell if the underlying transport had an error).
> 
> v: this PML uses logging and replay to effect some level of fault tolerance.  
> It's a distant fork of the ob1 PML, but has quite a few significant 
> differences.
> 
> ob1: this is the "main" PML that most users use (TCP, shared memory, 
> OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
> messages across individual network transports.  Hence, it supports true 
> multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a 
> thin management later that marshals all the BTLs in the process together -- 
> it's mainly array handling, etc.  The ob1 PML is the one that decides 
> multi-rail/device splitting, etc.  The INRIA folks just published a paper 
> last week at Euro MPI about adjusting the ob1 scheduling algorithm to also 
> take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth 
> calculations.
> 
> Hope this helps!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-devel] roadmap

2010-09-22 Thread Samuel Thibault
Jeff Squyres, le Wed 22 Sep 2010 13:37:12 +0200, a écrit :
> I think we should support memory binding, even if it does weird things -- 
> i.e., dropping membinding support on a given OS shouldn't be an option.

That's why I'd tend to keep set_cpubind and set_membind, warning that
one may have impact on the other, providing a flag for those who really
care, and a binding guideline for normal users.

> And/or have an "atomic"-like function that sets the memory binding and 
> returns the process memory binding? 

I'm not sure to understand what this means.

> It would be good to put a sunset date or version on when hwloc_cpuset_foo 
> will expire (e.g., 6 months from now or two major revisions form now [1.3] -- 
> whichever comes last...?).

Ok.

> I'd also prefer a typedef than a #define for types (vs. a #define).

Sure.

Samuel


Re: [hwloc-devel] roadmap

2010-09-22 Thread Brice Goglin
Le 22/09/2010 16:30, Jeff Squyres a écrit :
> On Sep 22, 2010, at 8:09 AM, Brice Goglin wrote:
>
>   
>> hwloc_set_*? hwloc_objset* ? Anything better?
>>
>> hwloc_set_* might not be the best since we would have a hwloc_set_set()
>> function to set one bit :)
>> 
> Agreed.  Too bad, though -- I liked hwloc_set*.
>
> hwloc_group* (that seems kinda lame, though)
> hwloc_stuff* (hah)
> hwloc_bitmap*
>
> ?
>   

bitmap or bitmask would be acceptable to me.

>> By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be
>> renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe?
>> 
> How about just hwloc_set() which takes a single position parameter?  
> "onlyone" can be implied.
>   

In case you missed it: cpu() = zero() + set() and all_but_cpu() = fill()
+ clr()
Maybe just drop these?

Brice



Re: [hwloc-devel] roadmap

2010-09-22 Thread Jeff Squyres
On Sep 22, 2010, at 8:09 AM, Brice Goglin wrote:

> hwloc_set_*? hwloc_objset* ? Anything better?
> 
> hwloc_set_* might not be the best since we would have a hwloc_set_set()
> function to set one bit :)

Agreed.  Too bad, though -- I liked hwloc_set*.

hwloc_group* (that seems kinda lame, though)
hwloc_stuff* (hah)
hwloc_bitmap*

?

> By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be
> renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe?

How about just hwloc_set() which takes a single position parameter?  "onlyone" 
can be implied.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-devel] roadmap

2010-09-22 Thread Samuel Thibault
Brice Goglin, le Wed 22 Sep 2010 10:38:38 +0200, a écrit :
> * Some OS bind the process too when you bind memory.

Not for all kinds of memory bindings. For now, nothing that has been
commited does that, it's only the remaining TODOs. The bindings in
question are policy binding, i.e. not binding some given area or
explicitly allocating some given size.

>   + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS

The length of the word tells me that won't be convenient :)

> so that the user can explicitly refuse memory binding if it may break
> process binding

>   + Drop hwloc_set_membind on these OSes and add a
> hwloc_set_cpumembind() to bind both

That's the solution I prefer most as it directly maps to existing OS
practice

>   + Make both process and memory binding do nothing if the STRICT flag
> is given. But I'd rather not play too much with this flag.

Yes. We should not put too vague semantic on this.

>   + Drop support for memory binding on these OS.

Not all support, just setting the policy.

>   + Drop these OS.

Nope :)

> * cpuset and nodeset structures are the same, they are both manipulated
> with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and
> hwloc_set_foo functions. With #define and aliases to not break API/ABIs.

I'd say so.

Samuel


Re: [OMPI devel] Question regarding recently common shared-memory component

2010-09-22 Thread Jeff Squyres
On Sep 21, 2010, at 12:37 PM,   
wrote:

> Like I said in my earlier response, I have never tried this option. So I ran 
> these tests on 1.4.2 now and apparently the behavior is same ie; the 
> checkpoint creation time increases when I enable shared memory componentL

I don't have huge experience with the checkpoint/restart stuff, but this is 
probably not a surprising result because the checkpoint will now need to 
include the shared memory stuff. Are the checkpoint images larger?  (at least: 
is one of them noticeably larger?)  That might account for the checkpoint 
performance difference.

> Is there any parameter that can be tuned to improve the performance?

My understanding is that there are some inherent bottlenecks in checkpoint / 
restart, such as the time required to dump out all the process images to disk.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Pascal Deveze
I just commited the very last modifications of ROMIO (mpich2-1.3rc1) 
into bitbucket.


Pascal

Jeff Squyres a écrit :

On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

  

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/



Sweet!

  

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).



Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

  

Testing:
1. runs fine except one minor error (see the explanation below) on various FS.
2. runs fine with Lustre, but:
   . had to add a small patch in romio/adio/ad_lustre_open.c



Did this patch get pushed upstream?

  

 The minor error ===
The test error.c fails because OpenMPI does not handle correctly the
"two level" error functions of ROMIO:
  error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
  myname, __LINE__, MPI_ERR_ARG,
  "**iobaddisp", 0);
OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".



Do you mean that we should be returning an error string "**iobaddisp" instead of 
"MPI_ERR_ARG"?

  




[OMPI devel] Setting AUTOMAKE_JOBS

2010-09-22 Thread Jeff Squyres
Some of you may be unaware that recent versions of automake can run in 
parallel.  That is, automake will run in parallel with a degree of (at most) 
$AUTOMAKE_JOBS.  This can speed up the execution time of autogen.pl quite a bit 
on some platforms.  On my cluster at cisco, here's a few quick timings of the 
entire autogen.pl process (of which, automake is the bottleneck):

$AUTOMAKE_JOBS   Total wall time
valueof autogen.pl
83:01.46
42:55.57
23:28.09
14:38.44

This is an older Xeon machine with 2 sockets, each with 2 cores.

There's a nice performance jump from 1 to 2, and a smaller jump from 2 to 4.  4 
and 8 are close enough to not matter.  YMMV.

I just committed a heuristic to autogen.pl to setenv AUTOMAKE_JOBS if it is not 
already set (https://svn.open-mpi.org/trac/ompi/changeset/23788):

- If lstopo is found in your $PATH, runs it and count how many PU's (processing 
units) you have.  It'll set AUTOMAKE_JOBS to that number, or a maximum of 4 
(which is admittedly a further heuristic).  
- If lstopo is not found, it just sets AUTOMAKE_JOBS to 2.

Enjoy.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Barrier() after Finalize() when a file handle is leaked.

2010-09-22 Thread Jeff Squyres
Thanks Lisandro!

I filed https://svn.open-mpi.org/trac/ompi/ticket/2594 about this.


On Sep 15, 2010, at 11:28 AM, Lisandro Dalcin wrote:

> I've tested this with (--enable-debug --enable-picky
> --enable-mem-debug) 1.4.2 and 1.5rc6. Despite being debug builds, a
> mpi4py user got the same with (likely release) builds in both Ubuntu
> and OS X.
> 
> $ cat open.c
> #include 
> int main(int argc, char *argv[]) {
>  MPI_File f;
>  MPI_Init(, );
>  MPI_File_open(MPI_COMM_WORLD, "test.plt", MPI_MODE_RDONLY, MPI_INFO_NULL, 
> );
>  /* MPI_File_close(); */
>  MPI_Finalize();
>  return 0;
> }
> 
> $ mpicc open.c
> 
> $ ./a.out
> *** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [trantor:15145] Abort after MPI_FINALIZE completed successfully; not
> able to guarantee that all other processes were killed!
> 
> 
> So if you open a file but never close it, a MPI_Barrier() gets called
> after MPI_Finalize(). Could that come from a finalizer ROMIO callback?
> However, I do not get this failure with MPICH2, and Open MPI seems to
> behave just fine regarding MPI_Finalized(), the code below work as
> expected:
> 
> #include 
> #include 
> 
> static int atexitmpi(MPI_Comm comm, int k, void *v, void *xs) {
>  int flag;
>  MPI_Finalized();
>  printf("atexitmpi: finalized=%d\n", flag);
>  MPI_Barrier(MPI_COMM_WORLD);
> }
> 
> int main(int argc, char *argv[]) {
>  int keyval = MPI_KEYVAL_INVALID;
>  MPI_Init(, );
>  MPI_Comm_create_keyval(MPI_COMM_NULL_COPY_FN, atexitmpi, , 0);
>  MPI_Comm_set_attr(MPI_COMM_SELF, keyval, 0);
>  MPI_Finalize();
>  return 0;
> }
> 
> 
> 
> -- 
> Lisandro Dalcin
> ---
> CIMEC (INTEC/CONICET-UNL)
> Predio CONICET-Santa Fe
> Colectora RN 168 Km 472, Paraje El Pozo
> Tel: +54-342-4511594 (ext 1011)
> Tel/Fax: +54-342-4511169
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Pascal Deveze

Jeff Squyres a écrit :

On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

  

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/



Sweet!

  

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).



Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

  
I also see this new release (mpich2-1.3rc1). I am going to report the 
modifications and inform the list.

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

  

Testing:
1. runs fine except one minor error (see the explanation below) on various FS.
2. runs fine with Lustre, but:
   . had to add a small patch in romio/adio/ad_lustre_open.c



Did this patch get pushed upstream?

  
This patch has been integrated yesterday in mpich2-1.3rc1 with another 
patch in romio/adio/common/lock.c. They will be available very soon in 
bitbucket.

 The minor error ===
The test error.c fails because OpenMPI does not handle correctly the
"two level" error functions of ROMIO:
  error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
  myname, __LINE__, MPI_ERR_ARG,
  "**iobaddisp", 0);
OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".



Do you mean that we should be returning an error string "**iobaddisp" instead of 
"MPI_ERR_ARG"?

  
In MPICH2, they have a file mpi/errhan/errnames.txt that will generate 
mpi/errhan/errnames.h making the links between codes
like "**iobaddisp" and the corresponding error string "Invalid 
displacement argument".
The error.c program tests the presence of "displacement" in the error 
string.


With OpenMPI ,the error message is:
" MPI_ERR_ARG: invalid argument of some other kind"

With MPICH2 , the error message is :
"Invalid argument, error stack:
MPI_FILE_SET_VIEW(60): Invalid displacement argument"

It would be better if OpenMPI displays at least the "Invalid 
displacement argument" message.

This is not a new problem in OpenMPI, it was also the case in the trunk.



Re: [hwloc-devel] roadmap

2010-09-22 Thread Brice Goglin
Le 22/09/2010 13:36, Jeff Squyres a écrit :
> On Sep 22, 2010, at 4:38 AM, Brice Goglin wrote:
>
>   
>> There are still some problems to solve in the membind branch:
>> * Some OS bind the process too when you bind memory. I see the following
>> solutions:
>>  + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS so that
>> the user can explicitly refuse memory binding if it may break process
>> binding
>>  + Drop hwloc_set_membind on these OSes and add a
>> hwloc_set_cpumembind() to bind both
>>  + Make both process and memory binding do nothing if the STRICT flag
>> is given. But I'd rather not play too much with this flag.
>>  + Drop support for memory binding on these OS.
>>  + Drop these OS.
>> 
> What OS's are you specifically referring to?
>   

IIRC, it was AIX and Solaris.

>  How about adding a query function that says what will happen for 
> hwloc_set_membind()

I like it, we can put this in the output of hwloc_topology_get_support.

I wonder if there are some other cases where the STRICT flag could be
dropped in favor of such an informative stuff.


> Just curious -- on these OS's, what happens if you:
>
> - bind proc to A
> - bind memory to B (which then also re-binds proc to B)
> - re-bind proc to A
>
> Is the memory binding then lost?
>   

I'll let Samuel comment on this.

>> * cpuset and nodeset structures are the same, they are both manipulated
>> with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and
>> hwloc_set_foo functions. With #define and aliases to not break API/ABIs.
>> 
> I'm in favor of this -- it would end the overloading of the term "cpuset" 
> between hwloc and cpuset.
>   

hwloc_set_*? hwloc_objset* ? Anything better?

hwloc_set_* might not be the best since we would have a hwloc_set_set()
function to set one bit :)

By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be
renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe?

Brice



Re: [OMPI devel] How to add a schedule algorithm to the pml

2010-09-22 Thread Jeff Squyres
Sorry for the delay in replying -- I was in Europe for the past two weeks; 
travel always makes me wy behind on my INBOX...


On Sep 14, 2010, at 9:56 PM, 张晶 wrote:

> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I 
> can only find a  paper named  "Open MPI: A Flexible High Performance MPI" and 
> some annotation in the source file.  From them , I know ob1 has implemented   
> round-robin& weighted distribution algorithm. But after tracking the 
> MPI_Send(),I cann't figure out 
> the location of these implement ,let alone to add a new schedule algorithm. 
> I have two questions :
> 1.The location of the schedule algorithm ?

It's complicated -- I'd say that the PML is probably among the most complicated 
sections of Open MPI because it is the main "engine" that enforces the MPI 
point-to-point semantics.  The algorithm is fairly well distribute throughout 
the PML source code.  :-\

> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The 
> function of these components?

cm: this component drives the MTL point-to-point components.  It is mainly a 
thin wrapper for network transports that provide their own MPI-like matching 
semantics.  Hence, most of the MPI semantics are effectively done in the lower 
layer (i.e., in the MTL components and their dependent libraries).  You 
probably won't be able to do much here, because such transports (MX, Portals, 
etc.) do most of their semantics in the network layer -- not in Open MPI.  If 
you have a matching network layer, this is the PML that you probably use (MX, 
Portals, PSM).

crcpw: this is a fork of the ob1 PML; it add some failover semantics.

csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so 
you can tell if the underlying transport had an error).

v: this PML uses logging and replay to effect some level of fault tolerance.  
It's a distant fork of the ob1 PML, but has quite a few significant differences.

ob1: this is the "main" PML that most users use (TCP, shared memory, 
OpenFabrics, etc.).  It gangs together one or more BTLs to send/receive 
messages across individual network transports.  Hence, it supports true 
multi-device/multi-rail algorithms.  The BML (BTL multiplexing layer) is a thin 
management later that marshals all the BTLs in the process together -- it's 
mainly array handling, etc.  The ob1 PML is the one that decides 
multi-rail/device splitting, etc.  The INRIA folks just published a paper last 
week at Euro MPI about adjusting the ob1 scheduling algorithm to also take 
NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations.

Hope this helps!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-22 Thread Jeff Squyres
On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote:

> In charge of ticket 1888 (see at 
> https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
> I have put the resulting code in bitbucket at:
> http://bitbucket.org/devezep/new-romio-for-openmpi/

Sweet!

> The work in this repo consisted in refreshing ROMIO to a newer
> version: the one from the very last MPICH2 release (mpich2-1.3b1).

Great!  I saw there was another MPICH2 release, and I saw a ROMIO patch or 
three go by on the MPICH list recently.  Do you expect there to be major 
differences between what you have and those changes?

I don't have any parallel filesystems to test with, but if someone else in the 
community could confirm/verify at least one or two of the parallel filesystems 
supported in ROMIO, I think we should bring this stuff into the trunk soon.

> Testing:
> 1. runs fine except one minor error (see the explanation below) on various FS.
> 2. runs fine with Lustre, but:
>. had to add a small patch in romio/adio/ad_lustre_open.c

Did this patch get pushed upstream?

>  The minor error ===
> The test error.c fails because OpenMPI does not handle correctly the
> "two level" error functions of ROMIO:
>   error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
>   myname, __LINE__, MPI_ERR_ARG,
>   "**iobaddisp", 0);
> OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp".

Do you mean that we should be returning an error string "**iobaddisp" instead 
of "MPI_ERR_ARG"?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-devel] roadmap

2010-09-22 Thread Brice Goglin
Hello,

hwloc 1.0 was released in May. I think we should release 1.1 before
SC10, which means doing a first RC within a couple weeks.

trunk got many changes since 1.0, but nothing very important. trac says
we're missing memory binding, distances and user-defined process
restrictions. Memory binding is the most important one, it was supposed
to be in 1.0. I think we shouldn't defer 1.1 because of the others.

There are still some problems to solve in the membind branch:
* Some OS bind the process too when you bind memory. I see the following
solutions:
  + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS so that
the user can explicitly refuse memory binding if it may break process
binding
  + Drop hwloc_set_membind on these OSes and add a
hwloc_set_cpumembind() to bind both
  + Make both process and memory binding do nothing if the STRICT flag
is given. But I'd rather not play too much with this flag.
  + Drop support for memory binding on these OS.
  + Drop these OS.
* cpuset and nodeset structures are the same, they are both manipulated
with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and
hwloc_set_foo functions. With #define and aliases to not break API/ABIs.

Opinions ?
Brice



Re: [hwloc-devel] hwloc powerpc rhel5 and power7 patch

2010-09-22 Thread Alexey Kardashevskiy

On 21/09/10 19:34, Samuel Thibault wrote:

Just a last question: is it ok to include the /proc and /sys trees you
have posted in the hwloc testcases?

   


That's ok.