Re: [lustre-discuss] separate SSD only filesystem including HDD

Patrick Farrell Tue, 28 Aug 2018 08:37:31 -0700

Hmm – It’s possible you’ve got an issue, but I think more likely is that your 
chosen benchmarks aren’t capable of showing the higher speed.


I’m not really sure about your fio test - writing 4K random blocks will be 
relatively slow and might not speed up with more disks, but I can’t speak to it 
in detail for fio.  I would try a much larger size and possibly more processes 
(is numjobs the number of concurrent processes?)…

But I am sure about your other two:
Both of those tests (dd and cp) are single threaded, and if they’re running to 
Lustre (rather than to the ZFS volume directly), 1.3 GB/s is around the maximum 
expected speed.  On a recent Xeon, one process can write a maximum of about 
1-1.5 GB/s to Lustre, depending on various details.  Improving disk speed won’t 
affect that limit for one process, it’s a client side thing.  Try several 
processes at once, ideally from multiple clients (and definitely writing to 
multiple files), if you really want to see your OST bandwidth limit.

Also, a block size of 10GB is *way* too big for DD and will harm performance.  
It’s going to cause slowdown vs a smaller block size, like 16M or something.

There’s also limit on how fast /dev/zero can be read, especially with really 
large block sizes [it cannot provide 10 GiB of zeroes at a time, that’s why you 
had to add the “fullblock” flag, which is doing multiple reads (and writes)].  
Here’s a quick sample on a system here, writing to /dev/null (so there is no 
real limit on the write bandwidth of the destination):
dd if=/dev/zero bs=10G of=/dev/null count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s

Notice that 1.3 GB/s, the same as your result.

Try 16M instead:
saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s

Also note that multiple dds reading from /dev/zero will run in to issues with 
the bandwidth of /dev/zero.  /dev/zero is different than most people assume – 
One would think it just magically spews zeroes at any rate needed, but it’s not 
really designed to be read at high speed and actually isn’t that fast.  If you 
really want to test high speed storage, you may need a tool that allocates 
memory and writes that out, not just dd.  (ior is one example)

From: Zeeshan Ali Shah <[email protected]>
Date: Tuesday, August 28, 2018 at 9:52 AM
To: Patrick Farrell <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [lustre-discuss] separate SSD only filesystem including HDD

1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
--direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting

2) time cp x x2

3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock

any other way to test this plz let me know

/Zee



On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell 
<[email protected]<mailto:[email protected]>> wrote:
How are you measuring write speed?

________________________________
From: lustre-discuss 
<[email protected]<mailto:[email protected]>>
 on behalf of Zeeshan Ali Shah 
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, August 28, 2018 1:30:03 AM
To: [email protected]<mailto:[email protected]>
Subject: [lustre-discuss] separate SSD only filesystem including HDD

Dear All, I recently deployed 10PB+ Lustre solution which is working fine. 
Recently for  genomic pipeline we acquired another racks with dedicated compute 
nodes with single 24-NVME SSD Servers/Rack .  Each SSD server connected to 
Compute Node via 100 G Omnipath.

Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are not 
linearly scaling in terms of performance . for e..g single SSD write speed is 
1.3GB/sec , adding 5 of those in stripe mode should give us atleast 1.3x5 or 
less , but we still get 1.3 GB out of those 5 SSD .

Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to compute 
nodes distributed and parallel wise , NFS not an option .. tried glusterfs but 
due to its DHT it is slow..

I am thinking to add another Filesystem to our existing MDT and install 
OSTs/OSS over the NVME server.. mounting this specific ssd where needed. so 
basically we will end up having two filesystem (one with normal 10PB+ and 2nd 
with SSD)..

Does this sounds correct ?

any other advice please ..


/Zeeshan

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] separate SSD only filesystem including HDD

Reply via email to