Brian,
(Threads like this can get confusing, which Brian? :)
Brian D. Ropers-Huilman wrote:
Brian,
There are usually three or four categories of storage:
1) /home - small, just enough to keep source files and compile code
2) /scratch/local - distributed disks within a cluster for local
writing (think Gaussian)
3) /scratch/global - a high-performance (and higher cost) parallel
file system accessible by all nodes
4) /archive - a very large pool of spinning disks which receives data
from /scratch/global when a run (or set of consecutive runs) is
"complete." The idea is to clear off the expensive parallel system for
other run-time use, but that you still want to hold the data for some
future need.
We have 1-3. 4 is the equivalent to our 1 but we make it the user's
responsibility to move their data. I like your idea though.
I would keep your /home and /scratch/global separate.
I've thought about this and it makes sense on a couple of levels. a) a
lot of data that gets written to /scratch/global is fairly transient in
nature. Some results a user might keep, many others they discard. If
/home == /scratch/global, then chances are our backup tapes will be
littered with data that nobody wants. b) Not a single point of
failure. However, there are some advantages, I think, if you can merge
the two: a) You only have one disk to administer and all of your efforts
for fault tolerance, monitoring, and maintenance can be focused on that
device. When you're a one-man-cluster-army, sysadmining and
maintaining, testing, developing, and deploying codes, you learn to
appreciate consolidation of this nature. Sure, it may appear a single
point of failure, but the plan also includes an offsite backup volume
which can be vlan'ed into the cluster's network. If the local array
dies, the outside array can take its place (albeit, with significantly
reduced performance) until repairs can be made to the main array. The
offsite array should also be able to be physically moved (fairly
quickly) to our datacenter as a drop-in replacement.
The /scratch/global solution you pick will very much depend on how you
want it connected to your clusters. By definition (of your cluster
suite) you cannot have a system that relies on IB as not all of your
systems have IB. This leaves GbE as the only global means of
connection. If at all possible, I would dedicate a GbE interface on
all nodes who access /scratch/global.
Yes, this is unfortunate. But fortunately, very few problems running on
the current system need disk access on the level provided by an
IB-connected storage device. It would be good to have for later, but we
can pass for now. I agree with the separate networks as well. I've
heard this elsewhere.
Thanks for the advice!
Brian
--
--------------------------------------------------------
+ Brian R. Smith +
+ HPC Systems Analyst & Programmer +
+ Research Computing, University of South Florida +
+ 4202 E. Fowler Ave. LIB618 +
+ Office Phone: 1 (813) 974-1467 +
+ Mobile Phone: 1 (813) 230-3441 +
+ Organization URL: http://rc.usf.edu +
--------------------------------------------------------
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf