Peter, You have provided an excellent explanation. One thing that was clearly missing from my answer is "number of Lustre clients required to get 2 GB/Sec sustained unidirectional throughput".
If Lustre clients are going to be connected via standard gigabit ethernet, which gets maximum of 110 MB/Sec (almost impossible to achieve in real life). If we consider a modest 50 MB/Sec/client over GbE, you will need about 40 clients writing a single 4 way striped file with large blocksize to reach 2 GB/Sec. With 10 Gigabit ethernet, you will need at least 4-5 clients to get 2 GB/Sec aggregate throughput while writing a single 4 way striped file with large blocksize. And, as Peter said, this performance is for a fresh Lustre filesystem with only outer platters of disk used. On a different note, we have hacked sgpdd-survey tool that comes with Lustre-iokit to benchmark whole disk. You can get details in Bugzilla Bug 17218. In my experience, performance of IO drops by more than 60 % when disks start using inner platters. Cheers, _Atul Peter Grandi wrote: > [ ... ] > > >>> I am considering a new storage of 30 TB usable space with a 2 >>> GB/s sustained read write performance in clustered mode. >>> > > This spec is comically vague (see below) and anyhow it is going to be > quite challenging. Some people who usually know what they are doing > (CERN) currently expect around 20MB/s duplex transfer rate per TB, > and I know of one storage system that is getting 3-4GB/s duplex (on a > fresh install) with around 240 1TB drives (including overheads). If > you want to guarantee 2GB/s sustained duplex perhaps aiming for > 3-4GB/s is a good idea. See at the end for a similar conclusion. > > So getting to 2GB sustained and for duplex is going to require quite > careful consideration of the particular circumstances if one wants it > done with just 30GB worth of drives. > > As to vague, one Lustre storage I know of was initially specified > with the same level of (un)detail, and with a bit of prodding it got > a more definite target performance envelope. That is vital. > > >>> But not able to figure out sizing part of it like what OSS, what >>> OST and what MDS. >>> > > Partitioning space between various types of Lustre data areas is the > least of your problems. The bigger issues is the structure of the > storage system on which Lustre runs. > > >>> Urgent help would be highly appreciable >>> > > People usually pay for urgent help, especially for difficult cases. > You should hire a good consultant (e.g. from Sun) who will ask you a > lot of questions. > > >> Hi Deval, Lustre storage sizing is largely driven by: * Capacity >> required * Performance required * Type of workload >> > > Just about only on capacity required. Performance required given the > type of workload (both static, distribution of file sizes, and > dynamic, patterns of access) drives storage structure more than > storage sizing, and indeed later you talk about structure without > considering > > Storage and Lustre filesystem structure (not mere sizing) depends > greatly on things like size of files, sequentiality of access, size > of IO operations, number of files being concurrently worked on, > number of process concurrently working on the same file, etc.; a list > of several of these is here: > > http://www.sabi.co.uk/blog/0804apr.html#080415 > > A pretty vital detail here is how many clients will be in that target > of 2GB/s duplex, and distinctly for reading and writing. It could be > 20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at > the same time at 2GB/s over multiple 10Gb/s links, for example. > > Another interesting dimensions is whether a single storage pool is > necessary or not, or just a single namespace (to some extent Lustre > is in between) with multiple pools and suitable use of mountpoints: > > http://www.sabi.co.uk/blog/0906Jun.html#090614 > > It is also important to know the availability requirements for the > storage system. Does the "sustained" in "2 GB/s sustained" mean for a > stretch of time or 24x7? > > Someone who asks for "Urgent help" should be nice enough to provide > all these interesting aspects of requirements to the storage > consultant they are going to hire. > > >> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say >> you are using SATA disks for OST. A Seagate enterprise 1TB SATA >> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go >> upto 110 MB/Sec if blocksize is really large). >> > > Unfortunately only on the outer tracks and on a fresh filesystem. See > for example: > > https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007 > > «Performance degradation on xc2 > After 6 months of production we lost half of the file system > performance > Problem is under investigation by HP > We had a similar problem on xc1 which was due to fragmentation > Current solution for defragmentation is to recreate file > systems» > > Note that it is not just "due to fragmentation", even without it as a > filesystem fills blocks will (usually) start being allocated from the > inner tracks and thus the raw transfer rate will eventually nearly halve: > > base# disktype /dev/sdd | grep 'device, size' > Block device, size 931.5 GiB (1000203804160 bytes) > base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s > base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s > > Amazingly in the "outer tracks and on fresh filesystem" case a modern > 1TB SATA disk with a reasonable file system type can do 110MB/s even > with smallish block sizes: > > base# fdisk -l /dev/sdd | grep sdd3 > /dev/sdd3 * 2 1769 14201460 17 Hidden HPFS/NTFS > base# mkfs.ext4 -q /dev/sdd3 > base# mount -t ext4 /dev/sdd3 /mnt > base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST > 100000+0 records in > 100000+0 records out > 6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s > > (BTW I have used 'ext4' because this is about Lustre, I usually > prefer JFS for various reasons) > > I have been quite impressed that one can get 90MB/s "outer tracks and > on fresh filesystem" from a contemporary low power laptop drive: > > http://www.sabi.co.uk/blog/0906Jun.html#090605 > > >> Assuming that you are looking for RAID6 protection for OST, >> you need 10 SATA disks to form a 8 TB lun. >> > > Why would you assume that? (see below on formulaic approaches) Why > use parity RAID that is know to cause performance problems on writes > unless they are all aligned, when the only detail provided is a > target for writing? Perhaps it is a DAQ or other recording > application given the 2GB/s goal, but perhaps it does not do large > writes. > > >> You will need 4 such OSTs to give you 32 TB unformatted space. >> Lets consider performance: >> > > >> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data >> disks in (8+2) RAID6 set]. But you have to cater for overhead of >> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware >> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 >> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total >> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec]. >> > > That "4 X 720MB/Sec" applies only if there are at least 4 stripes per > file and they get written in bulk, as you imply immediately below, or > there are at least 4 files being written at the same time and they > end up on different OSTs. The very vague requirement for "clustered > mode" does not quite make clear which one. > > >> So, now you have a storage system that delivers 32 TB unformatted >> space and 2.8 GB/Sec of performance for large sequential read/write >> workload. >> > > The read and write performance may be quite different on many > workloads because of the RAID6 (stripe alignment), even if probably > "large sequential" is going to be fine, and again only if the > concurrency is just right and on outer tracks on a fresh filesystem. > > >> If you are planning to have mixed or small io workload and still >> want to achieve 2 GB/Sec throughput, you have to double the specs. >> > > Why just double? Why not consider other storage systems like RAID10 > or SSDs? > > >> Small, random IO (think of home directories) kills storage >> performance. >> > > Depends on the storage system... > > >> Lets size MDS now. >> > > >> It is a good idea to use FC or SAS disks for MDS as they spin at >> higher rate and have better IOPS performance. For example, lets >> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put >> 4 such SAS disks in RAID10 configuration for MDT which will give >> you 600 GB of unformatted space. [ ... ] >> > > The MDS is another story indeed. > > But I seem to detect here a formulaic approach: the Lustre "don't > need to think" approach seems to be SAS RAID10 for metadata and SATA > RAID6 for data, and this is what is being discussed here, straight > out of the 3-ring binder, without asking any further questions > despite the extreme vagueness of the target. Which is mostly better > BTW than what a site I know got from EMC as their "don't need to > think" formula at the time seemed to be RAID3 of all things. > > Fine (perhaps), but I have a different formulaic approach: without > knowing all the details, and even if in some cases parity RAID does > make sense: > > http://www.sabi.co.uk/blog/0709sep.html#070923b > > my "generic" formula (and apparently shared by several academic sites > that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or > thei more recent equivalents) and RAID10 a bunch of disks inside them > (and then use JFS/XFS or Lustre on top, possibly with DRBD between). > With Lustre it is easy then to aggregate them by spreading the OSTs > across multiple "Thumper"s. > > In this case given that he goal is roughly 70MB/s duplex sustained > per TB of storage, which is rather high, so I would use either SSDs > or lots of small fast SAS drives for data (or lots of large SATA ones > with the data partition only in the outer 1/3 of the disk, which some > people call "short stroking"). > > Depending on how big the writes are and how big the files are and the > degree of concurrency, and the availability target, and all the other > important aspects of the requirements, most importantly the read and > write access patterns, as writing and then reading implies quite a > bit of head movement. > > If we assume the 20MB/s duplex rule per TB that CERN uses for bulk > storage, that translates to 100x SATA 1TB drives, or around 200x 1TB > with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but > higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea. > > But the details matter a great deal. Your mileage may vary. > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
