Deat Adilger, it is single server with 24nvmes with 100g opa card On Fri, Aug 31, 2018 at 11:20 AM Andreas Dilger <[email protected]> wrote:
> Just to confirm, there is only a single NVMe device in each server node, > or there is a single server with 24 NVMe devices in it? > > Depending on what you want to use the NVMe storage for (e.g. very fast > short-term scratch == burst buffer) it may be OK to just make a Lustre > filesystem with each NVMe device a separate OST with no redundancy. The > failure rate for these devices is low, and adding redundancy will hurt > performance. > > Cheers, Andreas > > On Aug 29, 2018, at 00:14, Zeeshan Ali Shah <[email protected]> wrote: > > > > Thanks a lot Patrick for detail answer, I tried with gnu parallel with > dd and over all the throughput was increased locally .. you are right it is > due to client side single thread issue. > > > > what about 2nd challenge to export bunch of NVMes from single server as > shared volume ? I tried glusterfs (very slow due to dht) , Lustre by > creating another filesystem in our existing MDT could be an option . I > furthermore tried to export single NVME-over Fabric (NVMEOF) looks > promising but i am looking to have a shared volume kind ... > > > > any advice ? > > > > On Tue, Aug 28, 2018 at 6:37 PM Patrick Farrell <[email protected]> wrote: > >> Hmm – It’s possible you’ve got an issue, but I think more likely is > that your chosen benchmarks aren’t capable of showing the higher speed. > >> > >> I’m not really sure about your fio test - writing 4K random blocks will > be relatively slow and might not speed up with more disks, but I can’t > speak to it in detail for fio. I would try a much larger size and possibly > more processes (is numjobs the number of concurrent processes?)… > >> > >> But I am sure about your other two: > >> Both of those tests (dd and cp) are single threaded, and if they’re > running to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is > around the maximum expected speed. On a recent Xeon, one process can write > a maximum of about 1-1.5 GB/s to Lustre, depending on various details. > Improving disk speed won’t affect that limit for one process, it’s a client > side thing. Try several processes at once, ideally from multiple clients > (and definitely writing to multiple files), if you really want to see your > OST bandwidth limit. > >> > >> Also, a block size of 10GB is *way* too big for DD and will harm > performance. It’s going to cause slowdown vs a smaller block size, like > 16M or something. > >> > >> > >> There’s also limit on how fast /dev/zero can be read, especially with > really large block sizes [it cannot provide 10 GiB of zeroes at a time, > that’s why you had to add the “fullblock” flag, which is doing multiple > reads (and writes)]. Here’s a quick sample on a system here, writing to > /dev/null (so there is no real limit on the write bandwidth of the > destination): > >> > >> dd if=/dev/zero bs=10G of=/dev/null count=1 > >> 0+1 records in > >> 0+1 records out > >> 2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s > >> > >> Notice that 1.3 GB/s, the same as your result. > >> > >> > >> > >> Try 16M instead: > >> > >> saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024 > >> 1024+0 records in > >> 1024+0 records out > >> 17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s > >> > >> > >> > >> Also note that multiple dds reading from /dev/zero will run in to > issues with the bandwidth of /dev/zero. /dev/zero is different than most > people assume – One would think it just magically spews zeroes at any rate > needed, but it’s not really designed to be read at high speed and actually > isn’t that fast. If you really want to test high speed storage, you may > need a tool that allocates memory and writes that out, not just dd. (ior > is one example) > >> > >> > >> > >>> From: Zeeshan Ali Shah <[email protected]> > >>> Date: Tuesday, August 28, 2018 at 9:52 AM > >>> To: Patrick Farrell <[email protected]> > >>> Cc: "[email protected]" <[email protected] > > > >>> Subject: Re: [lustre-discuss] separate SSD only filesystem including > HDD > >>> > >>> 1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite > --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting > >>> 2) time cp x x2 > >>> 3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock > >>> > >>> any other way to test this plz let me know > >>> > >>> /Zee > >>> > >>> > >>> > >>>> On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <[email protected]> wrote: > >>>> > >>>> How are you measuring write speed? > >>>> > >>>> > >>>>> From: lustre-discuss <[email protected]> on > behalf of Zeeshan Ali Shah <[email protected]> > >>>>> Sent: Tuesday, August 28, 2018 1:30:03 AM > >>>>> To: [email protected] > >>>>> Subject: [lustre-discuss] separate SSD only filesystem including HDD > >>>>> > >>>>> > >>>>> > >>>>> Dear All, I recently deployed 10PB+ Lustre solution which is working > fine. Recently for genomic pipeline we acquired another racks with > dedicated compute nodes with single 24-NVME SSD Servers/Rack . Each SSD > server connected to Compute Node via 100 G Omnipath. > >>>>> > >>>>> > >>>>> > >>>>> Issue 1: is that when I combined SSDs in stripe mode using zfs we > are not linearly scaling in terms of performance . for e..g single SSD > write speed is 1.3GB/sec , adding 5 of those in stripe mode should give us > atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD . > >>>>> > >>>>> > >>>>> > >>>>> Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs > to compute nodes distributed and parallel wise , NFS not an option .. tried > glusterfs but due to its DHT it is slow.. > >>>>> > >>>>> > >>>>> > >>>>> I am thinking to add another Filesystem to our existing MDT and > install OSTs/OSS over the NVME server.. mounting this specific ssd where > needed. so basically we will end up having two filesystem (one with normal > 10PB+ and 2nd with SSD).. > >>>>> > >>>>> > >>>>> Does this sounds correct ? > >>>>> > >>>>> > >>>>> > >>>>> any other advice please .. > > Cheers, Andreas > --- > Andreas Dilger > Principal Lustre Architect > Whamcloud > > > > > > > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
