Here is my GSoC proposal.

Thanks,

-- 
Okoye D.C.
Project title: “Mon Ami OSCAR”

Benefits to OSCAR:
OSCAR is no doubt very popular as a cluster manager in HPC environments. This 
comes as no surprise when its high configurability and flexibility with regards 
to supported linux distributions is taken into account. Alas, installing a 
cluster resource manager is only the first step in managing a cluster; 
initiatives have to be employed to ensure that individual nodes in the cluster 
are somewhat resilient to failures. Incorporating a framework that allows 
monitoring of services running on the various nodes and in the case of a 
service failure attempt to restart the service otherwise failing over to a 
compute node designated as a standby will ensure a more robust OSCAR. 

Project Synopsis:
Currently OSCAR can install a cluster, perform managerial tasks such as 
addition/deletion of nodes and also monitor the status of the cluster with 
ganglia or nagios. HA-OSCAR, an extension of OSCAR introduces redundancy at the 
head-node level by duplicating the primary head-node and based on predefined 
policies carries out specific actions to guarantee availability of this 
head-node. OSCAR cannot monitor the states of services concurrently running on 
all compute nodes such as lam , pbs_mom and take predefined actions in the case 
of failures. I propose integration of the universal monitor monami which would 
reside on the compute nodes reporting the status of some essential services to 
a global monitor nagios. Nagios would handle the failover and re-integration of 
a previosly failed node into the cluster.

Deliverables:
1. OSCAR package with compute-node resiliency integration.
2. Documentation.

Project Details:
I will create a mechanism similar to the one present in HA-OSCAR that will 
allow services on each compute node to be monitored locally. This can be 
implemented with monami and the results reported to a global monitor resident 
on the headnode. This global monitor will initially attempt to restart a failed 
service failure of which it would "smartly" remove the node from the cluster. 
Upon resolution of the initial problem, monami would notify the global monitor 
of the node's availability for work in which case it would get re-integrated 
into the cluster. Since the bane of work lies in implementing a smart global 
monitor capable of managing the cluster and conforming to user specified 
policies such as what should be done when a service cannot be restarted I can 
either choose to use the already provided nagios or an alternative offering 
more flexibility.

Project Schedule:
Initially I will spend time with my mentor defining exactly what policies and 
options the user should have to specify regarding handling of erroneous nodes. 
I will also have to discuss what the most suitable global monitoring mechanism 
will be with particular emphasis on number of linux distributions supported. 
Installing monami on redhat and debian based systems will need architecture 
specific rpms in the case of redhat and sources in the case of debian. After 
deciding on the best packaging approach, I will need help repackaging for 
OSCAR. Finally I will use the OSCAR package manager to handle installation of 
the required components. An advantage of this would be that all dependencies 
required for these components will have to be resolved. The basic design will 
be modular with each component being a distinct OSCAR package. This will make 
maintenance simple, and make the installation, setup, configuration, and 
running rely on the OSCAR API so future updates to OSCAR should not affect the 
any of these components.

Project Timeline:
Week1:
Clearly define the scope of the project, obtain any administrative information 
needed, define a regular schedule for collaboration with my mentor, and 
conclude design plans.

Week2-3:
Make required oscar packages to handle building, configuration and installation 
of monami and nagios.

Week4:
Determine steps required to add and remove a node from the list of available 
nodes

Week5-7:
Modify nagios to implement intelligent node failover, removals, and additions.

Week8-10:
Extensive testing and code clean up.

Personal Information:

I am a Senior in Computer Science at Louisiana Tech University. I was 
introduced to OSCAR through an introductory work on HA-OSCAR. My work in 
HA-OSCAR involved eliminating the core dependencies to ensure that HA-OSCAR was 
compatible with any version of OSCAR without any recoding. This meant I had to 
study the OSCAR installation procedure extensively with particular emphasis on 
the system installation suite. I am a highly motivated individual with an 
infinite capacity to work this was why I was able to remove the core 
dependencies of HA-OSCAR under 3 weeks of continuous work. I plan to graduate 
in the Fall of 2009 and enter graduate school the following winter. I also land 
in leadership positions very frequently.
Finally, I look forward to working with the OSCAR developer team eagerly.
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Reply via email to