Jeff, Is that EuroMPI2010 ob1 paper publicly available? I get involved in various NUMA partitioning/architecting studies and it seems there is not a lot of discussion in this area.
Ken Lloyd ================== Kenneth A. Lloyd Watt Systems Technologies Inc. -----Original Message----- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Wednesday, September 22, 2010 6:00 AM To: Open MPI Developers Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml Sorry for the delay in replying -- I was in Europe for the past two weeks; travel always makes me waaaay behind on my INBOX... On Sep 14, 2010, at 9:56 PM, 张晶 wrote: > I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I > can only find a paper named "Open MPI: A Flexible High Performance MPI" and > some annotation in the source file. From them , I know ob1 has implemented > round-robin& weighted distribution algorithm. But after tracking the > MPI_Send(),I cann't figure out > the location of these implement ,let alone to add a new schedule algorithm. > I have two questions : > 1.The location of the schedule algorithm ? It's complicated -- I'd say that the PML is probably among the most complicated sections of Open MPI because it is the main "engine" that enforces the MPI point-to-point semantics. The algorithm is fairly well distribute throughout the PML source code. :-\ > 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The > function of these components? cm: this component drives the MTL point-to-point components. It is mainly a thin wrapper for network transports that provide their own MPI-like matching semantics. Hence, most of the MPI semantics are effectively done in the lower layer (i.e., in the MTL components and their dependent libraries). You probably won't be able to do much here, because such transports (MX, Portals, etc.) do most of their semantics in the network layer -- not in Open MPI. If you have a matching network layer, this is the PML that you probably use (MX, Portals, PSM). crcpw: this is a fork of the ob1 PML; it add some failover semantics. csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so you can tell if the underlying transport had an error). v: this PML uses logging and replay to effect some level of fault tolerance. It's a distant fork of the ob1 PML, but has quite a few significant differences. ob1: this is the "main" PML that most users use (TCP, shared memory, OpenFabrics, etc.). It gangs together one or more BTLs to send/receive messages across individual network transports. Hence, it supports true multi-device/multi-rail algorithms. The BML (BTL multiplexing layer) is a thin management later that marshals all the BTLs in the process together -- it's mainly array handling, etc. The ob1 PML is the one that decides multi-rail/device splitting, etc. The INRIA folks just published a paper last week at Euro MPI about adjusting the ob1 scheduling algorithm to also take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations. Hope this helps! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel