Hey guys ( and gals ? )~~
First -- and introduction to those of you I have not met or worked with
yet. I am Nicholas Henke from Univ. or Pennsylvania, and I develop a
cluster suite based on bproc/Maui/supermon called Clubmask (
clubmask.sf.net, www.liniac.upenn.edu )
Now for the good stuff... Clubmask is at it's present form a complete
cluster installer and runtime environment -- like OSCAR -- hardware to
running jobs. We are looking to reduce the amount of maintainable code
in Clubmask, and reduce it to just the job and resource managment part
of it. As we already use SIS for installing nodes, the shift to creating
Clubmask as an OSCAR package should be fairly minimal.
What I am asking the oscar folks is if that will ever happen ( dont' say
yes yet... it get tricky ) ? In order to run Clubmask, you need to have:
1) a bproc patched kernel ( SRPMS provided by us )
2) the _latest_ version of the Maui scheduler, 3.2.5 or 3.2.6alpha.
3) LAM mpi support should be ready to go in the near future, but mpich
is completely unsupported
4) We also do not use ganglia ( _hell_ on switches and bad for
performance ), but do provide a supermon2ganglia translator that allows
you to use the ganglia web frontend with the supermon backend.
Now, we have started to change things in Clubmask to move towards a more
'Clubmask-$version.rpm' installation of Clubmask, but before I make any
great effort to getting it working in OSCAR, I was hoping for some
feedback on the feasibility of it. We currently run and support Clubmask
on 8 clusters, and are extremely happy with it. Below are some pros and
cons that we see --
PROS:
1) NOT OPENPBS ( and the crowds rejoice ) -- that means a state of the
art and maintained cluster job manager :)
2) Uses Bproc -- much easier to control jobs and processes.
3) Designed to be 'admin friendly' -- there are a ton of features in
clubmask that make it nice for admins ( as that is what we are here at
UPenn). For example: There is the ability to set 'triggers' in the
database that can fire off data to a host and port -- we use this for
node crash notification, high node swap use, etc. etc. Basically -- we
tell clubmask to trigger on any node that 'swap_free < 100', and when
that is gone past, it dumps us a message that we can redirect to a file,
or pager etc. Also -- the database can be scripted, and interacted with
just like any other database -- the data for the entire cluster can be
seen.
4) Coded in Python -- extremely easy to debug and extend. There are some
C Python extensions for speed and scalability.
5) Completely open source -- GPL'd.
6) Allows for either disk-full or disk-less clusters.
CONS:
1) Need a new kernel. This means that the user needs to reboot the
machine after installing it, as bproc support needs to be loaded for
Clubmask.
2) Not PBS -- well ok, people are used to using PBS ( don't they call
that learned helplessness ? ). We try to provide an similar set of
features to PBS, but some things just wont be the same.
3) Currently not fully Single System Image ( disk-less isnt fully worked
out )
4) Not fully mature and stable -- but we are working on it.
5) Security -- we are working on securing Clubmask, but are
concentrating more now on actual functionality.
OK -- well that is deal so far -- there is a good amount of work to be
done to get this shoe-horned into OSCAR, and I want to make sure it will
be worth the effort.
Thanks for your time!
Nic
--
Nicholas Henke
Penguin Herder & Linux Cluster System Programmer
Liniac Project - Univ. of Pennsylvania
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
Oscar-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-devel