On May 14, 2008 14:21 -0400, jrs wrote: > I work for a small/medium company that does image processing. > We have about 700TB of data presently and might be at 2PB within > the next couple of years. Owing to the amount of data we don't > make backups for most of it and trust raid 6 on our hardware raid > boxes (nexsan Satabeast) to fail more slowly than we can replace > disks. Over the last couple of years we've had great luck and, > I believe, have never lost data owing to a failure with this > hardware (software or human error is another matter ;-). > However, the unbacked up data is "mission critical." Though > it can, probably, all be reconstructed or reacquired, as a practical > matter losing a significant quantity of this data could be > catastrophic for our business. > > So, what do you think, can lustre be trusted to keep our > data safe at our company? Assume in answering that we have > failover working properly. We can also withstand some blocking > of the filesystem while a failover event completes, i.e., not > having the filesystem available for some amount of time is > not a problem, but having directory important-data/ disappear > is a HUGE problem.
You are confusing two separate ideas - availability and backup. Having RAID1/5/6 and failover allows for data to be accessible in the face of hardware failures without (much) interruption. Having a second copy of your data allows for data to be accessible (usually after a longer delay) in a much wider range of scenarios, like multiple hardware failure, software errors, human errors, site catastrophe, etc. There have been a few customer incidences recently where a user (whether malicious or uninformed), or malformed script was deleting filesystem data at a very high rate, and by the time someone noticed the problem hundreds of TB of data had been deleted in each case. That is nothing that RAID6 or failover will save you from. Similarly, even with RAID6 it is possible to have multiple-drive failures after events like power outages because usually all of the drives in a RAID set are from the same manufacturing batch and are more likely to fail at one time. Very large sites that have annual power maintenance outages have enough of these kinds of failures to advertise users back up their important files before the outage. So, I think the important point I'm making is that no matter how reliable Lustre (or any storage) is, not having any proper backup is asking for trouble in the long run. In my opinion, if you have a large shared filesystem, a user-driven backup system is the best model. Users are the ones best informed of what data is the most important to keep, and if the onus of backup is communicated to them clearly they only have themselves to blame. If you use Lustre for a single data repository for some application, and all of the files are equally important, then my only suggestion is to go to some configuration with a full second copy of the data that is updated on a regular (though not continuous) basis. If it is updated continuously then any "rm -r" kind of error will also propagate to the backup too quickly. The backup system can be MUCH less performant than the primary copy, and you can do things like oversubscribe the OSTs to single OSS nodes, and have less RAM on the servers. Considering that a low-performance 700TB filesystem can probably be built for a cost of around $200k you have to weigh the costs of this against the potential business cost of losing some or all of your data. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
