On May 2, 2013, at 9:18 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Ralph, very quick reply as I've got an SGI engineer waiting for > me.. ;-) > > On 03/05/13 12:21, Ralph Castain wrote: > >> So the first problem is: how to know the Phi's are present, how >> many you have on each node, etc? We could push that into something >> like the hostfile, but that requires that someone build the file. >> Still, it would only have to be built once, so maybe that's not too >> bad - could have a "wildcard" entry if every node is the same, >> etc. > > We're using Slurm, and it supports them already apparently, so I'm not > sure if that helps? It does - but to be clear: your saying that you can directly launch processes onto the Phi's via srun? If so, then this may not be a problem, assuming you can get confirmation that the Phi's have direct access to the interconnects. If the answer to both is "yes", then just srun the MPI procs directly - we support direct launch and use PMI to wireup. Problem solved :-) And yes - that support is indeed in the 1.6 series...just configure --with-pmi. You may need to provide the path to where pmi.h is located under the slurm install, but probably not. > >> Next, we have to launch processes across the PCI bus. We had to do >> an "rsh" launch of the MPI procs onto RR's cell processors as they >> appeared to be separate "hosts", though only visible on the local >> node (i.e., there was a stripped-down OS running on the cell) - >> Paul's cmd line implies this may also be the case here. If the same >> method works here, then we have most of that code still available >> (needs some updating). We would probably want to look at whether or >> not binding could be supported on the Phi local OS. > > I believe that is the case - you can login via SSH to them is my > understanding. We've not got that far with ours yet.. > >> Finally, we have to wire everything up. This is where RR got a >> little tricky, and we may encounter the same thing here. On RR, the >> cell's didn't have direct access to the interconnects - any >> messaging had to be relayed by a process running on the main cpu. >> So we had to create the ability to "route" MPI messages from >> processes running on the cells to processes residing on other >> nodes. > > Gotcha. > >> Solving the first two is relatively straightforward. In my mind, >> the primary issue is the last one - does anyone know if a process >> on the Phi's can "see" interconnects like a TCP NIC or an >> Infiniband adaptor? > > I'm not sure, but I can tell you that the Intel RPMs include an OFED > install that looks like it's used on the Phi (if my reading is correct). > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlGDOoAACgkQO2KABBYQAh/ZrQCgjwf5PDZWF7LYYcujxfLgiYP4 > lLYAn1tMt4AQ0/Jz0o+gJMvudfEGjf99 > =vQ5j > -----END PGP SIGNATURE----- > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel