Here is my GSoC proposal.
Thanks,
--
Okoye D.C.
Project title: Mon Ami OSCAR
Benefits to OSCAR:
OSCAR is no doubt very popular as a cluster manager in HPC environments. This
comes as no surprise when its high configurability and flexibility with regards
to supported linux distributions is taken into account. Alas, installing a
cluster resource manager is only the first step in managing a cluster;
initiatives have to be employed to ensure that individual nodes in the cluster
are somewhat resilient to failures. Incorporating a framework that allows
monitoring of services running on the various nodes and in the case of a
service failure attempt to restart the service otherwise failing over to a
compute node designated as a standby will ensure a more robust OSCAR.
Project Synopsis:
Currently OSCAR can install a cluster, perform managerial tasks such as
addition/deletion of nodes and also monitor the status of the cluster with
ganglia or nagios. HA-OSCAR, an extension of OSCAR introduces redundancy at the
head-node level by duplicating the primary head-node and based on predefined
policies carries out specific actions to guarantee availability of this
head-node. OSCAR cannot monitor the states of services concurrently running on
all compute nodes such as lam , pbs_mom and take predefined actions in the case
of failures. I propose integration of the universal monitor monami which would
reside on the compute nodes reporting the status of some essential services to
a global monitor nagios. Nagios would handle the failover and re-integration of
a previosly failed node into the cluster.
Deliverables:
1. OSCAR package with compute-node resiliency integration.
2. Documentation.
Project Details:
I will create a mechanism similar to the one present in HA-OSCAR that will
allow services on each compute node to be monitored locally. This can be
implemented with monami and the results reported to a global monitor resident
on the headnode. This global monitor will initially attempt to restart a failed
service failure of which it would "smartly" remove the node from the cluster.
Upon resolution of the initial problem, monami would notify the global monitor
of the node's availability for work in which case it would get re-integrated
into the cluster. Since the bane of work lies in implementing a smart global
monitor capable of managing the cluster and conforming to user specified
policies such as what should be done when a service cannot be restarted I can
either choose to use the already provided nagios or an alternative offering
more flexibility.
Project Schedule:
Initially I will spend time with my mentor defining exactly what policies and
options the user should have to specify regarding handling of erroneous nodes.
I will also have to discuss what the most suitable global monitoring mechanism
will be with particular emphasis on number of linux distributions supported.
Installing monami on redhat and debian based systems will need architecture
specific rpms in the case of redhat and sources in the case of debian. After
deciding on the best packaging approach, I will need help repackaging for
OSCAR. Finally I will use the OSCAR package manager to handle installation of
the required components. An advantage of this would be that all dependencies
required for these components will have to be resolved. The basic design will
be modular with each component being a distinct OSCAR package. This will make
maintenance simple, and make the installation, setup, configuration, and
running rely on the OSCAR API so future updates to OSCAR should not affect the
any of these components.
Project Timeline:
Week1:
Clearly define the scope of the project, obtain any administrative information
needed, define a regular schedule for collaboration with my mentor, and
conclude design plans.
Week2-3:
Make required oscar packages to handle building, configuration and installation
of monami and nagios.
Week4:
Determine steps required to add and remove a node from the list of available
nodes
Week5-7:
Modify nagios to implement intelligent node failover, removals, and additions.
Week8-10:
Extensive testing and code clean up.
Personal Information:
I am a Senior in Computer Science at Louisiana Tech University. I was
introduced to OSCAR through an introductory work on HA-OSCAR. My work in
HA-OSCAR involved eliminating the core dependencies to ensure that HA-OSCAR was
compatible with any version of OSCAR without any recoding. This meant I had to
study the OSCAR installation procedure extensively with particular emphasis on
the system installation suite. I am a highly motivated individual with an
infinite capacity to work this was why I was able to remove the core
dependencies of HA-OSCAR under 3 weeks of continuous work. I plan to graduate
in the Fall of 2009 and enter graduate school the following winter. I also land
in leadership positions very frequently.
Finally, I look forward to working with the OSCAR developer team eagerly.
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel