On Nov 21, 2011, at 5:04 PM, <milind.bhandar...@emc.com> wrote: > Ralph, > > Yes, I have completed the first step, although I would really like that > code to be part of the MPI Application Master (Chris Douglas suggested a > way to do this at ApacheCon). > > Regarding the remaining steps, I have been following discussions on the > open mpi mailing lists, and reading code for hwloc. > > If you are making a trip to Cisco HQ sometime soon, I would like to have a > face-to-face about hwloc.
Not sure that looks likely right now - my project at Cisco is done, and it appears I'll be leaving the company soon. > I have so far avoided to use a native task > controller for spawning MPI jobs, but given the lack of support for > binding in Java, it looks like I will have to bite the bullet. I was actually looking at porting the binding support to Java as it looks feasible to do so, and I can understand not wanting to absorb all that configuration code to handle it in C. Given the loss of job, I have some free time on my hands while I search for employment, so I thought I might spend it looking at the Hadoop integration - since you have completed the wireup, I might look at this next. > > - milind > > --- > Milind Bhandarkar > Greenplum Labs, EMC > (Disclaimer: Opinions expressed in this email are those of the author, and > do not necessarily represent the views of any organization, past or > present, the author might be affiliated with.) > > > > On 11/21/11 3:54 PM, "Ralph Castain" <r...@open-mpi.org> wrote: > >> Hi Milind >> >> Glad to hear of the progress - I recall our earlier conversation. I >> gather you have completed step 1 (wireup) - have you given any thought to >> the other two steps? Anything I can do to help? >> >> Ralph >> >> >> On Nov 21, 2011, at 4:47 PM, <milind.bhandar...@emc.com> wrote: >> >>> Hi Ralph, >>> >>> I spoke with Jeff Squyres at SC11, and updated him on the status of my >>> OpenMPI port on Hadoop Yarn. >>> >>> To update everyone, I have OpenMPI examples running on #Yarn, although >>> it >>> requires some code cleanup and refactoring, however that can be done as >>> a >>> later step. >>> >>> Currently, the MPI processes come up, get submitting client's IP and >>> port >>> via environment variables, connect to it, and do a barrier. The result >>> of >>> this barrier is that everyone in MPI_COMM_WORLD gets each other's >>> endpoints. >>> >>> I am aiming to submit the patch to hadoop by the end of this month. >>> >>> I will publish the openmpi patch to github. >>> >>> (As I mentioned to Jeff, OpenMPI requires a CCLA for accepting >>> submissions. That will take some time.) >>> >>> - Milind >>> >>> --- >>> Milind Bhandarkar >>> Greenplum Labs, EMC >>> (Disclaimer: Opinions expressed in this email are those of the author, >>> and >>> do not necessarily represent the views of any organization, past or >>> present, the author might be affiliated with.) >>> >>> >>> >>>> >>>> I'm willing to do the integration work, but wanted to check first to >>>> see >>>> if (a) someone in the Hadoop community is already doing so, and (b) if >>>> you would be interested in seeing such a capability and willing to >>>> accept >>>> the code contribution? >>>> >>>> Establishing MPI support requires the following steps: >>>> >>>> 1. wireup support. MPI processes need to exchange endpoint info (e.g., >>>> for TCP connections, IP address and port) so that each process knows >>>> how >>>> to connect to any other process in the application. This is typically >>>> done in a collective "modex" operation. There are several ways of doing >>>> it - if we proceed, I will outline those in a separate email to solicit >>>> your input on the most desirable approach to use. >>>> >>>> 2. binding support. One can achieve significant performance >>>> improvements >>>> by binding processes to specific cores, sockets, and/or NUMA regions >>>> (regardless of using MPI or not, but certainly important for MPI >>>> applications). This requires not only the binding code, but some logic >>>> to >>>> ensure that one doesn't "overload" specific resources. >>>> >>>> 3. process mapping. I haven't verified it yet, but I suspect that >>>> Hadoop >>>> provides each executing instance with an identifier that is unique >>>> within >>>> that job - e.g., we typically assign an integer "rank" that ranges >>>> from 0 >>>> to the number of instances being executed. This identifier is critical >>>> for MPI applications, and the relative placement of processes within a >>>> job often dictates overall performance. Thus, we would provide a >>>> mapping >>>> capability that allows users to specify patterns of process placement >>>> for >>>> their job - e.g., "place one process on each socket on every node". >>>> >>>> I have written the code to implement the above support on a number of >>>> systems, and don't foresee major problems doing it for Hadoop (though I >>>> would welcome a chance to get a brief walk-thru the code from someone). >>>> Please let me know if this would be of interest to the Hadoop >>>> community. >>>> >>>> Thanks >>>> Ralph Castain >>>> >>>> >>>> >>> >> >> >