Just to confirm, there is only a single NVMe device in each server node, or there is a single server with 24 NVMe devices in it?
Depending on what you want to use the NVMe storage for (e.g. very fast short-term scratch == burst buffer) it may be OK to just make a Lustre filesystem with each NVMe device a separate OST with no redundancy. The failure rate for these devices is low, and adding redundancy will hurt performance. Cheers, Andreas On Aug 29, 2018, at 00:14, Zeeshan Ali Shah <[email protected]> wrote: > > Thanks a lot Patrick for detail answer, I tried with gnu parallel with dd and > over all the throughput was increased locally .. you are right it is due to > client side single thread issue. > > what about 2nd challenge to export bunch of NVMes from single server as > shared volume ? I tried glusterfs (very slow due to dht) , Lustre by creating > another filesystem in our existing MDT could be an option . I furthermore > tried to export single NVME-over Fabric (NVMEOF) looks promising but i am > looking to have a shared volume kind ... > > any advice ? > On Tue, Aug 28, 2018 at 6:37 PM Patrick Farrell <[email protected]> wrote: >> Hmm – It’s possible you’ve got an issue, but I think more likely is that >> your chosen benchmarks aren’t capable of showing the higher speed. >> >> I’m not really sure about your fio test - writing 4K random blocks will be >> relatively slow and might not speed up with more disks, but I can’t speak to >> it in detail for fio. I would try a much larger size and possibly more >> processes (is numjobs the number of concurrent processes?)… >> >> But I am sure about your other two: >> Both of those tests (dd and cp) are single threaded, and if they’re running >> to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is around the >> maximum expected speed. On a recent Xeon, one process can write a maximum >> of about 1-1.5 GB/s to Lustre, depending on various details. Improving disk >> speed won’t affect that limit for one process, it’s a client side thing. >> Try several processes at once, ideally from multiple clients (and definitely >> writing to multiple files), if you really want to see your OST bandwidth >> limit. >> >> Also, a block size of 10GB is *way* too big for DD and will harm >> performance. It’s going to cause slowdown vs a smaller block size, like 16M >> or something. >> >> >> There’s also limit on how fast /dev/zero can be read, especially with really >> large block sizes [it cannot provide 10 GiB of zeroes at a time, that’s why >> you had to add the “fullblock” flag, which is doing multiple reads (and >> writes)]. Here’s a quick sample on a system here, writing to /dev/null (so >> there is no real limit on the write bandwidth of the destination): >> >> dd if=/dev/zero bs=10G of=/dev/null count=1 >> 0+1 records in >> 0+1 records out >> 2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s >> >> Notice that 1.3 GB/s, the same as your result. >> >> >> >> Try 16M instead: >> >> saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024 >> 1024+0 records in >> 1024+0 records out >> 17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s >> >> >> >> Also note that multiple dds reading from /dev/zero will run in to issues >> with the bandwidth of /dev/zero. /dev/zero is different than most people >> assume – One would think it just magically spews zeroes at any rate needed, >> but it’s not really designed to be read at high speed and actually isn’t >> that fast. If you really want to test high speed storage, you may need a >> tool that allocates memory and writes that out, not just dd. (ior is one >> example) >> >> >> >>> From: Zeeshan Ali Shah <[email protected]> >>> Date: Tuesday, August 28, 2018 at 9:52 AM >>> To: Patrick Farrell <[email protected]> >>> Cc: "[email protected]" <[email protected]> >>> Subject: Re: [lustre-discuss] separate SSD only filesystem including HDD >>> >>> 1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite >>> --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting >>> 2) time cp x x2 >>> 3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock >>> >>> any other way to test this plz let me know >>> >>> /Zee >>> >>> >>> >>>> On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <[email protected]> wrote: >>>> >>>> How are you measuring write speed? >>>> >>>> >>>>> From: lustre-discuss <[email protected]> on behalf >>>>> of Zeeshan Ali Shah <[email protected]> >>>>> Sent: Tuesday, August 28, 2018 1:30:03 AM >>>>> To: [email protected] >>>>> Subject: [lustre-discuss] separate SSD only filesystem including HDD >>>>> >>>>> >>>>> >>>>> Dear All, I recently deployed 10PB+ Lustre solution which is working >>>>> fine. Recently for genomic pipeline we acquired another racks with >>>>> dedicated compute nodes with single 24-NVME SSD Servers/Rack . Each SSD >>>>> server connected to Compute Node via 100 G Omnipath. >>>>> >>>>> >>>>> >>>>> Issue 1: is that when I combined SSDs in stripe mode using zfs we are >>>>> not linearly scaling in terms of performance . for e..g single SSD write >>>>> speed is 1.3GB/sec , adding 5 of those in stripe mode should give us >>>>> atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD . >>>>> >>>>> >>>>> >>>>> Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to >>>>> compute nodes distributed and parallel wise , NFS not an option .. tried >>>>> glusterfs but due to its DHT it is slow.. >>>>> >>>>> >>>>> >>>>> I am thinking to add another Filesystem to our existing MDT and install >>>>> OSTs/OSS over the NVME server.. mounting this specific ssd where needed. >>>>> so basically we will end up having two filesystem (one with normal 10PB+ >>>>> and 2nd with SSD).. >>>>> >>>>> >>>>> Does this sounds correct ? >>>>> >>>>> >>>>> >>>>> any other advice please .. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
