Hi Erich,

I'd say that the GPFS failover groups are a good example of exactly what I'm talking about.

From [1]:

---

GPFS failover support allows you to organize your hardware into a number of failure groups. A failure group is a set of disks that share a common point of failure that could cause them all to become simultaneously unavailable. When used in conjunction with the replication feature of GPFS, the creation of multiple failure groups provides for increased file availability should a group of disks fail. GPFS maintains each instance of replicated data and metadata on disks in different failure groups. Should a set of disks become unavailable, GPFS fails over to the replicated copies in another failure group.

During configuration, you assign a replication factor to indicate the total number of copies of data and metadata you wish to store. Replication allows you to set different levels of protection for each file or one level for an entire file system. Since replication uses additional disk space and requires extra write time, you might want to consider replicating only file systems that are frequently read from but seldom written to. To reduce the overhead involved with the replication of data, you may also choose to replicate only metadata as a means of providing additional file system protection. For further information on GPFS replication, see File system recoverability parameters.

---

You can see here that this is *not* something that they intend for general use, especially not for write-heavy workloads (like computational science). Further, this is the mechanism that they suggest avoiding, instead using shared hardware and failover.

*Conceptually* lots of things are possible, and in fact there are a lot of really interesting ideas that have been pursued in research and production domains. Panasas has an interesting way of driving redundant storage from clients, as another production example.

So far these approaches aren't widely used in production HEC deployments, to my knowledge, because they simply slow things down too much. They might make good sense in a bioinformatics application, etc., where datasets are often read-only.

The Ceph group at UCSC is another group that is looking at options in this area, close to home for you.

[1] http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs23/bl1ins10/bl1ins1015.html

Erich Weiler wrote:
IBM's GPFS has done this quite nicely with primary and redundant server disks, also they use a concept called 'failover' groups that provide backups for nodes with common failure points. It's a sort of replication technique, not exactly a RAID 5 type of redundancy but it works. I understand this kind of thing is not trivial to code; but conceptually it seems doable.

-erich

Rob Ross wrote:
Hi Steve,

We get this question a lot.

Software redundancy in a parallel file system is a very challenging problem, particularly to provide efficient access at the same time.

The group at Clemson has been looking into this as a research project, and I believe that others have as well. If a group creates a solution that performs well, reliably operates, and fits into the rest of the PVFS system, then we would certainly consider integrating it into the production releases. This hasn't happened so far...

Regards,

Rob

Steve wrote:
Is built in redundancy planned ? Or not in the scope of the project ?

Steve

Trusting my 1.1Tb to the reliability of my drives, and touch wood in 20
years of computing had never had a drive fail. Now ive just put a curse on
them!
-------Original Message------- From: Robert Latham Date: 24/04/2007 14:14:13 To: Erich Weiler Cc: [email protected] Subject: Re: [Pvfs2-users] Question about redundancy On Mon, Apr 23, 2007 at 05:03:39PM -0700, Erich Weiler wrote:
I need to be clear on this before putting a lot of time into it, but it sounds like this might be a good solution for our firm, as we have a 200 node cluster each with one 500GB disk, 400GB of which can be leveraged to a massive parallel file system (400GB x 200 nodes = one big ~80TB distributed file system). But that assumes that there is no redundancy, other wise that 80TB would be more like 50-60TB max or something because there would be some redundancy in there... ?
Murali's explanation is spot-on: no software-based reduncancy scheme. For users concerned with redundancy, we suggest hardware failover to Shared storage, which works quite well. ==rob

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to