Brian,

(Threads like this can get confusing, which Brian? :)

Brian D. Ropers-Huilman wrote:
Brian,

There are usually three or four categories of storage:

1) /home - small, just enough to keep source files and compile code
2) /scratch/local - distributed disks within a cluster for local
writing (think Gaussian)
3) /scratch/global - a high-performance (and higher cost) parallel
file system accessible by all nodes
4) /archive - a very large pool of spinning disks which receives data
from /scratch/global when a run (or set of consecutive runs) is
"complete." The idea is to clear off the expensive parallel system for
other run-time use, but that you still want to hold the data for some
future need.
We have 1-3. 4 is the equivalent to our 1 but we make it the user's responsibility to move their data. I like your idea though.

I would keep your /home and /scratch/global separate.
I've thought about this and it makes sense on a couple of levels. a) a lot of data that gets written to /scratch/global is fairly transient in nature. Some results a user might keep, many others they discard. If /home == /scratch/global, then chances are our backup tapes will be littered with data that nobody wants. b) Not a single point of failure. However, there are some advantages, I think, if you can merge the two: a) You only have one disk to administer and all of your efforts for fault tolerance, monitoring, and maintenance can be focused on that device. When you're a one-man-cluster-army, sysadmining and maintaining, testing, developing, and deploying codes, you learn to appreciate consolidation of this nature. Sure, it may appear a single point of failure, but the plan also includes an offsite backup volume which can be vlan'ed into the cluster's network. If the local array dies, the outside array can take its place (albeit, with significantly reduced performance) until repairs can be made to the main array. The offsite array should also be able to be physically moved (fairly quickly) to our datacenter as a drop-in replacement.

The /scratch/global solution you pick will very much depend on how you
want it connected to your clusters. By definition (of your cluster
suite) you cannot have a system that relies on IB as not all of your
systems have IB. This leaves GbE as the only global means of
connection. If at all possible, I would dedicate a GbE interface on
all nodes who access /scratch/global.

Yes, this is unfortunate. But fortunately, very few problems running on the current system need disk access on the level provided by an IB-connected storage device. It would be good to have for later, but we can pass for now. I agree with the separate networks as well. I've heard this elsewhere.
Thanks for the advice!

Brian

--
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to