Re: [Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Brian R. Smith Mon, 19 Mar 2007 12:45:27 -0800

Mike,

Mike Davis wrote:

Brian,
My personal preference is to keep scratch on the compute nodeswhenever possible. This reduces network traffic dramatically. Themethod that we use for this is to write wrappers for our more commonapplications (VASP, G03, DeMon) setting scratch to /tmp on the nodes.For interactive work such as abaqus the env variables are set to you/tmp as well.

I usually provide both.  Some parallel applications that we run write
simultaneously to one file or another and others don't (maybe each
process writes out multiple files or simply hits the file system really
hard).  We have PVFS2 for parallel scratch and we provide /slocal as a
local disk alternative (apps like Schrodinger's Jaguar & Qsite run very
well under a local scratch configuration).  The solution I'm looking for
will (hopefully) replace our need for an independent parallel scratch
and /home storage. It would be good if we could run both off of one
filesystem (you know, administrative overhead and stuff).

We do not make use of Myranet or infiniband at this time and we have120 current nodes on one HP modular switch (combo of dual single anddual dual opterons). I have another 70 dual duals and one 8way dual toadd this year.
One thing that you might consider for performance is a completelyseparate nfs network. Connecting all nodes to a switch with a 10gbgbic that connects to the thumper or other storage device.

This is a possibility...  It would be easy to roll out and would provide
great benefit for file system use (that and all of our nodes now have
dual GigE links).

We use zfs for some purposes (/home on a number of research machinessharing a 5.6TB storage unit with a v40z frontend and nfs) but untilnow the clusters still use ext3 (yes, I know that there are problems).

I was looking into zfs, but the linux "client" appears to run via FUSE.
I've heard good things but I've also heard a couple not-so-good things
about the performance of userspace file system access.  Your thoughts?

I've thought that Mississippi State's design with individual rackgigabit switches connected by 10gb gbics was interesting. But we aregetting good performance with all of the nodes connected at 1gb to asingle switch and I think that it is easier to manage.

I agree with this point about the single switch.  If its possible for us
to do this, I would prefer to go that route.  Otherwise, we have a
couple other plans up our sleeves.
Thanks for your suggestions!

Brian

Mike Davis


Brian R. Smith wrote:
Hey list,
I am seeking some advice regarding our latest project. Currently,our shop runs 5 different clusters of varying size and handles themaintenance and administration of each. I've been planning for sometime to finally consolidate all of these machines together under asingle head-node with a common storage pool (for /home, /opt,/usr/local), utilizing SGE for our resource management. A lot oftimes on this list, the point comes up that many things depend uponyour applications so I'll make it clear here: Our "application" isquite varied. Our users come from a wide variety of disciplines andthe nature of our group is as a sort of tier-2 scientific computinglab where we provide hardware, development environments, and supportfor developing and running applications of various nature hencegeneral-purpose.
We have fairly robust systems in place for node provisioning (anin-house system that utilizes kickstart and anaconda that supportsmultiple architectures), resource management (SGE has provenextremely reliable and more than capable of managing our fairlyquaint resources).
Currently, my two largest problems are figuring out our storage needs(in terms of device bandwidth and throughput) and our network needs.When all is said and done, this is the hardware I expect to have:
~60x 16GiB RAM, Dual-Dual-Core AMD Opterons, IB-connected,GigE-connected, with modest local storage8x 16GiB RAM, Dual-Dual-Core AMD Opterons & 24x 8GiBRAM,Dual-Opterons, Myrinet-connected, GigE-connected, modest storage(cluster1)
We wish to add to this cluster the following existing configurations:
12x AMD Opteron 246, 4GiB RAM, Myrinet-connected, etc. (cluster2)
38x AMD Opteron 246, 4GiB RAM, GigE-connected (clusters 3 & 4)
~40x Intel P4 Xeon @2.66GHz, 2GiB RAM, GigE-connected (cluster5)
Yes, I know the last sets of machines are approaching (or alreadyare) legacy status (especially the last batch), but these machinesare still useful at running the problems they were originallypurchased for (especially the Opterons), and are still very good atsome other general tasks (Distributed Matlab, commercial FE codes,instructional use, etc).Currently, each cluster has its own local storage, averaging about a1TB on each. We've currently got about 4TB of total data across allof these machines but anticipate this number possibly doubling within the next 12-18 months. The first phase of this plan (which mustoccur in concert with the second) is to consolidate all of thesedisparate arrays into one volume that is accessible by every node inthe cluster. I know that some of the supercomputing centers likeNCSA have dealt with much larger-scale storage issues than this soI'd love to hear from one of you. The current ideas that we havebeen floating around include the following:
1. Proprietary parallel storage systems (like Panasas, etc.): Itprovides the per-node bandwidth, aggregate bandwidth, cachingmechanisms, fault-tolerance, and redundancy that we require (plushaving a vendor offering 24x7x365 support & 24 hour turnover is quitea breath of fresh air for us). Price point is a little high for theamount of storage that we will get though, little more than doublingour current overall capacity. As far as I can tell, I can use thisdevice as a permanent data store (like /home) and also as the user'sscratch space so that there is only a single point for all data needsacross the cluster. It does, however, require the installation ofvendor kernel modules which do often add overhead to systemadministration (as they need to be compiled, linked, and testedbefore every kernel update).
2. Separate /home and /scratch volumes. /home would be NFS exportedread-only to all hosts (to prevent writes during run-time). Thevolume would reside on one or two file servers (Sun's Thumper/X4500,etc. either on JFS or GFS (or perhaps ZFS???), depending on hardware)and at current prices, we would be able to acquire around 20TB. Wewould double this purchase and provide the same setup off-site forredundancy (including our tape-backup regime). Bandwidth for readsis more than sufficient for the needs of our current users. Thescratch space would be comprised of 8-12 nodes with 0.5 TB RAID1storage utilizing either PVFS2 (which has worked exceptionally wellfor us previously) or Lustre (which we have not tested very wellyet). Both require separate kernel modules (this seems to be arecurring theme) and hence some additional administration. Neitherare well-suited for general tasks such as compiling (though there areways around this) or problems involving many short writes, but mostof the applications being run do not fit this profile. 8-12 nodesshould provide us between 3-6TB of usable scratch. We would like alittle more, but again, this is sufficient for our current usagepatterns. The pricing for this might be somewhat less than theproprietary system described above.
Can anyone suggest any other approaches to this problem?
We also have a problem regarding how to link these clusters togetherover a single network fabric (GigE). It will be possible for allnodes to utilize this network for Message Passing, but it is highlyimprobable that such a scenario will ever be played out since almostall of our MPI jobs will no doubt run on either the Infiniband ourMyrinet nodes (there are SGE policies in place to help ensure this).Currently, each cluster has its own GigE network for provisioning,administration, and resource management. Some of these hosts utilizeit for communications (clusters 3, 4, & 5) and all of them will nodoubt need to utilize it for filesystem access. Clusters 3 and 4 canbe consolidated to a single GigE HP switch that will have a couple ofports left over. Cluster 5 will have to be kept as-is and clusters 1and 2 will fit on a single switch as well. I have discussed with ourcampus network admin the possibility of using two recent ciscoswitches that would support failover and load balancing as aredundant and high-bandwidth "trunk" for each of these networks,obviously with the capacity to grow in the future. Each of ourexisting 3 switches would have up to two links to each "trunk" switchand our file servers (in which ever configuration we eventuallychoose) would also be attached to these switches. There should beenough bandwidth to go around under this plan. I'm just curious ifthis seems doable and if it is, are there any obvious pitfalls that Ihave overlooked? Is there perhaps a better way to approach this(perhaps a single, large switch instead)?
Our final problem is a relatively simple one though I am definitely anewbie to the H.A. world. Under this consolidation plan, we willhave only one point of entry to this cluster and hence a single pointof failure. Have any beowulfers had experience with deployingclusters with redundant head nodes in a pseudo-H.A. fashion(heartbeat monitoring, fail-over, etc.) and what experiences have youhad in adapting your resource manager to this task? Would it simplybe more feasible to move the resource manager to another machine atthis point (and have both headnodes act as submit and administrativeclients)? My current plan is unfortunately light on the details ofhandling SGE in such an environment. It includes purchasing twoidentical 1U boxes (with good support contracts). They will monitoreach other for availability and the goal is to have the spare takeover if the master fails. While the spare is not in use, I wasplanning on dispatching jobs to it.There are a number of unfilled blanks in this plan currently (and Ihave a month with which to fine-tune the rest of this) and so ifanyone would be kind enough to offer suggestions on how to fill in afew I'd appreciate it.
Thanks to all in advance for any help!

Brian Smith


--
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Reply via email to