Hi folks

I have been familiarizing myself with the Hadoop 0.23 code tree, and found 
myself wondering if people were aware of the tools commonly used by the HPC 
community as I worked my way thru the code. Just in case the community isn't, I 
thought it might be worth a very brief summary of the field.

The HPC community has a long history of implementing and fielding resource and 
application management tools that are fault tolerant with respect to process 
and node failures. These systems readily scale to the 30K node and 100K process 
size today, and are running on 10s of thousands of clusters around the world. 
Efforts to extend capabilities to the 100K node, 1M process level are 
aggressively being pursued and expected to be fielded in the next two years.

Although these systems support MPI applications, there is nothing exclusively 
MPI about them. All will readily execute any application, including a MapReduce 
operation. HDFS support is lacking, of course, but can be readily added in most 
cases.

My point here is that replicating all the fault tolerant resource and 
application management capabilities of the existing systems is a major 
undertaking that may not be necessary. The existing systems represent hundreds 
of man-years of effort, coupled with thousands of machine-years of operational 
experience - together, these provide a level of confidence that will be hard to 
duplicate.

In the HPC world, resource managers and application managers are typically 
separate entities. Although many people use the AM that comes with a particular 
RM, almost all RM/AM systems support a wide range of combinations so that users 
can pick-and-choose the pairing that best fits their needs. Both proprietary 
(e.g., Moab and LSF) and open-source (e.g., SLURM and Gridengine) versions of 
RMs and AMs are available, each with differing levels of fault tolerance.

For those not wanting to deal with the peculiarities of each combination, 
abstraction layers are available. For example, Open MPI's "mpirun" tool will 
transparently interface to nearly all available RMs and AMs, executing your 
application in the same manner regardless of the underlying systems. OMPI 
provides its own fault tolerance capabilities to ensure commonality across 
environments, and is capable of restarting and rewiring applications as 
required. In addition, OMPI allows the definition of "fault groups" (i.e., 
failure dependencies between nodes) to further protect against cascading 
failures, and a variety of software and hardware sensors to detect 
deteriorating behavior prior to actual node/process failure.

Multiple error response strategies are available from most existing systems, 
including OMPI. These are user-selectable at time of execution and range from 
simple termination of the entire application, to restarting processes on the 
current node (assuming that node is up or restarts), shifting processes to 
nodes currently being used by the application, and shifting processes to 
available "backup" nodes.

I can provide more info as requested, but wanted to at least make you aware of 
the existing capabilities. A quick "poll" of HPC RM/AM providers indicates a 
willingness to add HDFS support - they haven't done so to-date because nobody 
asked them to do so, and they tend to respond to user demand. No technical 
barrier is immediately apparent.

HTH
Ralph

Reply via email to