[ ... ] adilger> Putting 4 OSTs on a single disk doesn't make sense. adilger> A single OST can be up to 8TB, and if you have multiple adilger> OSTs on the same disk(s) it will cause terrible adilger> performance problems due to seeking.
Uhm, not exactly, that's a quick but simplistic answer: things are more complicated than that. The seeking depends strictly on access patterns and number of disks in most cases. Suppose that you have a 1TB disk and divide it into one or two filesystems: for a given file set (assumption relaxed later) and access pattern the same bits of the disk will be accessed. The two filesystems end up being mostly super-cylinder-groups, that mostly disjoined free space allocation pools. There are secondary effects as to the disjoined free space allocations (one filesystem means allocations can spread all over the disk, two filesystems will restrict allocation to two separate pools, which most likely will improve clustering). Then two separate filesystems are more resilient to serious mangling, and might fsck faster (because of the better clustering) if done sequentially. But the assumption "given file set" does not hold if the two filesystems are part of the same Lustre filesystem *and* striping is happening. In that case two objects that are parts of the same Lustre file will usually end up on the two partitions and Lustre will assume that they can be fetched in parallel but cannot really, and this may reduce performance. But the the overall effect will not be big; it will mostly be the same as if the max object size had been doubled, because again performance depends mostly on file access patterns and number of drives. For small files though it will halve the number of disks on which it can stripe, but this can be countered by halving the max object size. Consider this example, a max object size of 1MiB, and a 100MiB file and 10 drives and striping. With one filesystem per drive you can read 10MiB in paralle in 1MiB objects (stripe size 10MiB). With two filesystems per drive you can read 20MiB in parallel (stripe size 20MiB) in 2x1MiB objects that are serialized by the drive. If the max object size is changed to 512KiB in the two filesystem per drive, you can still read 10MiB in parallel in 2x512MiB objects (back to the 10MiB stripe size). Now one might argue that in the 10x1MiB case the 1MiB is likely to be more contiguous than in the 10x2x512KiB case, where the two 512KiB objects being forced to be in different halves of the disk, but then let me point out that the 100MiB file striped across the 10 drives in 1MiB objects has got 10x1MiB objects per drive, anyhow and whether they are clustered or not is mostly up to luck. So the issue really is whether 20x512KiB objects per drive are going to be less clustered than 10x1MiB objecs, and my guess is that it does not matter a lot, and in some cases it might be of benefit. Anyhow, there is a case where two OSTs per drive is most likely of benefit. That's the case where two OSTs belong to two Lustre filesystems, one faster (outer track OSTs) and used more often and one slower (inner track OSTs) and used less often. That means a crude form of hand-clustering. Still though performance likely depends more on the overall file access patterns and the number of disks than on whether they are split across two distinct allocation pools. Note 1: a fair bit also depends on the in-cylinder-group allocation policy of 'ldiskfs' and how often the allocator will switch to a different cylinder group and Note 2: maybe there is some special issue within Lustre that makes it rather less effective with the partitions per disk. Note 3: in many if not most (just a guess) Lustre installations the "disk" is actually a SAN RAID pool, and each OST is a LUN of that SAN RAID pool, and that LUN is in effect a slice of a partition off each disk. Now this is may not be at all what Lustre should be about :-). Amazing barely related discovery BTW: while searching info on the current cylinder group policies of file system designs in the 'ext' family, I found that there was an interesting filesystem called "ext4" in 1997, which has some elements reminiscent of Lustre (or the original UNIX filesystem design): http://www.cs.cmu.edu/~mihaib/fs/fs.html "A Dual-Disk File System: ext4 Mihai Budiu April 16, 1997" So RedHat and Linus should change the name of the recently introduced one to 'ext5'. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
