Re: Hadoop + MPI

Ralph Castain Mon, 21 Nov 2011 16:36:12 -0800

On Nov 21, 2011, at 5:04 PM, <milind.bhandar...@emc.com> wrote:

> Ralph,
> 
> Yes, I have completed the first step, although I would really like that
> code to be part of the MPI Application Master (Chris Douglas suggested a
> way to do this at ApacheCon).
> 
> Regarding the remaining steps, I have been following discussions on the
> open mpi mailing lists, and reading code for hwloc.
> 
> If you are making a trip to Cisco HQ sometime soon, I would like to have a
> face-to-face about hwloc.


Not sure that looks likely right now - my project at Cisco is done, and it 
appears I'll be leaving the company soon.

> I have so far avoided to use a native task
> controller for spawning MPI jobs, but given the lack of support for
> binding in Java, it looks like I will have to bite the bullet.

I was actually looking at porting the binding support to Java as it looks 
feasible to do so, and I can understand not wanting to absorb all that 
configuration code to handle it in C. Given the loss of job, I have some free 
time on my hands while I search for employment, so I thought I might spend it 
looking at the Hadoop integration - since you have completed the wireup, I 
might look at this next.

> 
> - milind
> 
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
> 
> 
> 
> On 11/21/11 3:54 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
> 
>> Hi Milind
>> 
>> Glad to hear of the progress - I recall our earlier conversation. I
>> gather you have completed step 1 (wireup) - have you given any thought to
>> the other two steps? Anything I can do to help?
>> 
>> Ralph
>> 
>> 
>> On Nov 21, 2011, at 4:47 PM, <milind.bhandar...@emc.com> wrote:
>> 
>>> Hi Ralph,
>>> 
>>> I spoke with Jeff Squyres  at SC11, and updated him on the status of my
>>> OpenMPI port on Hadoop Yarn.
>>> 
>>> To update everyone, I have OpenMPI examples running on #Yarn, although
>>> it
>>> requires some code cleanup and refactoring, however that can be done as
>>> a
>>> later step.
>>> 
>>> Currently, the MPI processes come up, get submitting client's IP and
>>> port
>>> via environment variables, connect to it, and do a barrier. The result
>>> of
>>> this barrier is that everyone in MPI_COMM_WORLD gets each other's
>>> endpoints.
>>> 
>>> I am aiming to submit the patch to hadoop by the end of this month.
>>> 
>>> I will publish the openmpi patch to github.
>>> 
>>> (As I mentioned to Jeff, OpenMPI requires a CCLA for accepting
>>> submissions. That will take some time.)
>>> 
>>> - Milind
>>> 
>>> ---
>>> Milind Bhandarkar
>>> Greenplum Labs, EMC
>>> (Disclaimer: Opinions expressed in this email are those of the author,
>>> and
>>> do not necessarily represent the views of any organization, past or
>>> present, the author might be affiliated with.)
>>> 
>>> 
>>> 
>>>> 
>>>> I'm willing to do the integration work, but wanted to check first to
>>>> see
>>>> if (a) someone in the Hadoop community is already doing so, and (b) if
>>>> you would be interested in seeing such a capability and willing to
>>>> accept
>>>> the code contribution?
>>>> 
>>>> Establishing MPI support requires the following steps:
>>>> 
>>>> 1. wireup support. MPI processes need to exchange endpoint info (e.g.,
>>>> for TCP connections, IP address and port) so that each process knows
>>>> how
>>>> to connect to any other process in the application. This is typically
>>>> done in a collective "modex" operation. There are several ways of doing
>>>> it - if we proceed, I will outline those in a separate email to solicit
>>>> your input on the most desirable approach to use.
>>>> 
>>>> 2. binding support. One can achieve significant performance
>>>> improvements
>>>> by binding processes to specific cores, sockets, and/or NUMA regions
>>>> (regardless of using MPI or not, but certainly important for MPI
>>>> applications). This requires not only the binding code, but some logic
>>>> to
>>>> ensure that one doesn't "overload" specific resources.
>>>> 
>>>> 3. process mapping. I haven't verified it yet, but I suspect that
>>>> Hadoop
>>>> provides each executing instance with an identifier that is unique
>>>> within
>>>> that job - e.g., we typically assign an integer "rank" that ranges
>>>> from 0
>>>> to the number of instances being executed. This identifier is critical
>>>> for MPI applications, and the relative placement of processes within a
>>>> job often dictates overall performance. Thus, we would provide a
>>>> mapping
>>>> capability that allows users to specify patterns of process placement
>>>> for
>>>> their job - e.g., "place one process on each socket on every node".
>>>> 
>>>> I have written the code to implement the above support on a number of
>>>> systems, and don't foresee major problems doing it for Hadoop (though I
>>>> would welcome a chance to get a brief walk-thru the code from someone).
>>>> Please let me know if this would be of interest to the Hadoop
>>>> community.
>>>> 
>>>> Thanks
>>>> Ralph Castain
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Hadoop + MPI

Reply via email to