Hi Erich,
I'd say that the GPFS failover groups are a good example of exactly what
I'm talking about.
From [1]:
---
GPFS failover support allows you to organize your hardware into a number
of failure groups. A failure group is a set of disks that share a common
point of failure that could cause them all to become simultaneously
unavailable. When used in conjunction with the replication feature of
GPFS, the creation of multiple failure groups provides for increased
file availability should a group of disks fail. GPFS maintains each
instance of replicated data and metadata on disks in different failure
groups. Should a set of disks become unavailable, GPFS fails over to the
replicated copies in another failure group.
During configuration, you assign a replication factor to indicate the
total number of copies of data and metadata you wish to store.
Replication allows you to set different levels of protection for each
file or one level for an entire file system. Since replication uses
additional disk space and requires extra write time, you might want to
consider replicating only file systems that are frequently read from but
seldom written to. To reduce the overhead involved with the replication
of data, you may also choose to replicate only metadata as a means of
providing additional file system protection. For further information on
GPFS replication, see File system recoverability parameters.
---
You can see here that this is *not* something that they intend for
general use, especially not for write-heavy workloads (like
computational science). Further, this is the mechanism that they suggest
avoiding, instead using shared hardware and failover.
*Conceptually* lots of things are possible, and in fact there are a lot
of really interesting ideas that have been pursued in research and
production domains. Panasas has an interesting way of driving redundant
storage from clients, as another production example.
So far these approaches aren't widely used in production HEC
deployments, to my knowledge, because they simply slow things down too
much. They might make good sense in a bioinformatics application, etc.,
where datasets are often read-only.
The Ceph group at UCSC is another group that is looking at options in
this area, close to home for you.
[1]
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs23/bl1ins10/bl1ins1015.html
Erich Weiler wrote:
IBM's GPFS has done this quite nicely with primary and redundant server
disks, also they use a concept called 'failover' groups that provide
backups for nodes with common failure points. It's a sort of
replication technique, not exactly a RAID 5 type of redundancy but it
works. I understand this kind of thing is not trivial to code; but
conceptually it seems doable.
-erich
Rob Ross wrote:
Hi Steve,
We get this question a lot.
Software redundancy in a parallel file system is a very challenging
problem, particularly to provide efficient access at the same time.
The group at Clemson has been looking into this as a research project,
and I believe that others have as well. If a group creates a solution
that performs well, reliably operates, and fits into the rest of the
PVFS system, then we would certainly consider integrating it into the
production releases. This hasn't happened so far...
Regards,
Rob
Steve wrote:
Is built in redundancy planned ? Or not in the scope of the project ?
Steve
Trusting my 1.1Tb to the reliability of my drives, and touch wood in 20
years of computing had never had a drive fail. Now ive just put a
curse on
them!
-------Original Message------- From: Robert Latham Date: 24/04/2007
14:14:13 To: Erich Weiler Cc: [email protected]
Subject: Re: [Pvfs2-users] Question about redundancy On Mon, Apr 23,
2007 at 05:03:39PM -0700, Erich Weiler wrote:
I need to be clear on this before putting a lot of time into it, but
it sounds like this might be a good solution for our firm, as we
have a 200 node cluster each with one 500GB disk, 400GB of which can
be leveraged to a massive parallel file system (400GB x 200 nodes =
one big ~80TB distributed file system). But that assumes that there
is no redundancy, other wise that 80TB would be more like 50-60TB
max or something because there would be some redundancy in there... ?
Murali's explanation is spot-on: no software-based reduncancy
scheme. For users concerned with redundancy, we suggest hardware
failover to Shared storage, which works quite well. ==rob
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users