>> I wrote a blog post that pertains to Lustre scalability and >> data integrity. You can find it here: >> http://braamstorage.blogspot.com
Ah amusing, but a bit late to the party. The DBMS community have been dealing with these issues for a very long time; consider the canonical definitions of "database" and "very large database": * "database": a mass of data whose working set cannot be held in memory; a mass of data where every access involves at least one physical IO. * "very large database": a mass of data that cannot be realistically taken offline for maintenance; a mass of data that takes "too long" to backup or check. But I am very pleased that the "fsck wall" is getting wider exposure, I have been pointing it out in my little corner for years. > [ ... ] like Veritas already solved this by > 1. Integrating the Volume management and File system. The file > system can be spread across many volumes. That's both crazy and nearly pointless. It is at best a dubious convenience. > 2. Dividing the file system into a group of file sets(like > data, metadata, checkpoints) , and allowing the policies to > keep different filesets on different volumes. That's also crazy and nearly pointless, as described. > 3. Creating the checkpoints (they are sort of like volume > snapshots, but they are created inside the file system > itself). [ ... ] These are an ancient feature of many fs designs, and for various reasons versioned filesystems have never been that popular. In part because of performance, in part because it is not that useful, in part because it is the wrong abstraction levbel. > 4. Parallel fsck - if the filesystem consists of the > allocation units - a sort of the sub- file systems, or > cylinder groups, then the fsck can be started in parallel > on those units. This either is pointless or not that useful. This can be done fairly trivially by using many filesystems, and creating a single namespace by "mounting" them together; of course then one does not have a single free storage pool, even if the namespace is stitched together. But it is exceptionally difficult to have a single storage pool *and* chunking (as soon as object contents are spread across mutiple chunks 'fsck' becomes hard, and if objects contents are not spread across multiple chunks, you don't really have a single storage pool). The fundamental problem with 'fsck' is that: * Data access scales up by using RAID, as N disks, with suitable access patterns, give a speedup of up to N (either in bandwidth or IOPS), so it is feasible to create very large storage systems by driving parallelism up at the data level. * Unfortunately while data performance *can* scale with the number of disks, metadata access cannot, because it is driven by wholly different access patterns, usually more graph-like than stream-like. In essence 'fsck' is a garbage collector, and thus it is both unavoidable, and exceptionally hard to parallelize. Note also that the "IOPS wall" (similar to the "memory wall"), where storage device capacity and bandwith grow faster than IOPS, eventually calls into question even data scalability, and in some applications (like the Lustre MDS) that is already quite apparent. > Well, the ZFS does solve many of these issues, but in a > different way, too. ZFS is not the solution to almost any problem, except perhaps sysadmin convenience. The UNIX lesson is that the main job of a file system is to provide a simple, trivial "dataspace" abstraction layer, and that trying to have it address storage (for example checksumming) or application layer (for example indices) concerns is poor design. It does seem quite convenient though (to the sort of people who want to do triple parity RAID and 46+2 RAID6 arrays, or build large filesystems as LVM2 concats [VGs] spanning several disks). > So, my point is that this probably has to be solved on the > backend side of the Lustre, rather than inside the Lustre. The Lustre has embodies a very specific set of tradeoffs aimed at a specific "sweet spot" as described by PeterB in his blogpost. Violating design integrity usually is very painful. A wholly new design is probably needed. As to scalability there is a proof of existence for extremely scalable file system designs, and that is GoogleFS, and it embodies pretty extreme tradeoffs (far more extreme than Lustre) in pursuit of scalability. If GoogleFS is the state of the art, then I suspect that very scalable, fine grained, and highly efficient are incompatible goals (and very, very rarely a requirement either). BTW I am occasionally reminded of two ancient MIT TRs, one by Peter Bishop about distributed persistent garbage collection, and one by Svobodova on object histories in the swallow repository. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
