Just to confirm, there is only a single NVMe device in each server node, or 
there is a single server with 24 NVMe devices in it?

Depending on what you want to use the NVMe storage for (e.g. very fast 
short-term scratch == burst buffer) it may be OK to just make a Lustre 
filesystem with each NVMe device a separate OST with no redundancy.  The 
failure rate for these devices is low, and adding redundancy will hurt 
performance.

Cheers, Andreas

On Aug 29, 2018, at 00:14, Zeeshan Ali Shah <[email protected]> wrote:
> 
> Thanks a lot Patrick for detail answer, I tried with gnu parallel with dd and 
> over all the throughput was increased locally .. you are right it is due to 
> client side single thread issue.
> 
> what about 2nd challenge to export bunch of NVMes from single server as 
> shared volume ? I tried glusterfs (very slow due to dht) , Lustre by creating 
> another filesystem in our existing MDT could be an option . I furthermore 
> tried to export single  NVME-over Fabric (NVMEOF) looks promising but i am 
> looking to have  a shared volume kind ...
> 
> any advice ?


> On Tue, Aug 28, 2018 at 6:37 PM Patrick Farrell <[email protected]> wrote:
>> Hmm – It’s possible you’ve got an issue, but I think more likely is that 
>> your chosen benchmarks aren’t capable of showing the higher speed.
>> 
>> I’m not really sure about your fio test - writing 4K random blocks will be 
>> relatively slow and might not speed up with more disks, but I can’t speak to 
>> it in detail for fio.  I would try a much larger size and possibly more 
>> processes (is numjobs the number of concurrent processes?)…
>> 
>> But I am sure about your other two:
>> Both of those tests (dd and cp) are single threaded, and if they’re running 
>> to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is around the 
>> maximum expected speed.  On a recent Xeon, one process can write a maximum 
>> of about 1-1.5 GB/s to Lustre, depending on various details.  Improving disk 
>> speed won’t affect that limit for one process, it’s a client side thing.  
>> Try several processes at once, ideally from multiple clients (and definitely 
>> writing to multiple files), if you really want to see your OST bandwidth 
>> limit.
>> 
>> Also, a block size of 10GB is *way* too big for DD and will harm 
>> performance.  It’s going to cause slowdown vs a smaller block size, like 16M 
>> or something.
>> 
>> 
>> There’s also limit on how fast /dev/zero can be read, especially with really 
>> large block sizes [it cannot provide 10 GiB of zeroes at a time, that’s why 
>> you had to add the “fullblock” flag, which is doing multiple reads (and 
>> writes)].  Here’s a quick sample on a system here, writing to /dev/null (so 
>> there is no real limit on the write bandwidth of the destination):
>> 
>> dd if=/dev/zero bs=10G of=/dev/null count=1
>> 0+1 records in
>> 0+1 records out
>> 2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s
>> 
>> Notice that 1.3 GB/s, the same as your result.
>> 
>> 
>> 
>> Try 16M instead:
>> 
>> saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
>> 1024+0 records in
>> 1024+0 records out
>> 17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s
>> 
>> 
>> 
>> Also note that multiple dds reading from /dev/zero will run in to issues 
>> with the bandwidth of /dev/zero.  /dev/zero is different than most people 
>> assume – One would think it just magically spews zeroes at any rate needed, 
>> but it’s not really designed to be read at high speed and actually isn’t 
>> that fast.  If you really want to test high speed storage, you may need a 
>> tool that allocates memory and writes that out, not just dd.  (ior is one 
>> example)
>> 
>> 
>> 
>>> From: Zeeshan Ali Shah <[email protected]>
>>> Date: Tuesday, August 28, 2018 at 9:52 AM
>>> To: Patrick Farrell <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>> Subject: Re: [lustre-discuss] separate SSD only filesystem including HDD
>>> 
>>> 1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite 
>>> --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting
>>> 2) time cp x x2
>>> 3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock
>>> 
>>> any other way to test this plz let me know
>>> 
>>> /Zee
>>> 
>>> 
>>> 
>>>> On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <[email protected]> wrote:
>>>> 
>>>> How are you measuring write speed?
>>>> 
>>>> 
>>>>> From: lustre-discuss <[email protected]> on behalf 
>>>>> of Zeeshan Ali Shah <[email protected]>
>>>>> Sent: Tuesday, August 28, 2018 1:30:03 AM
>>>>> To: [email protected]
>>>>> Subject: [lustre-discuss] separate SSD only filesystem including HDD
>>>>> 
>>>>> 
>>>>> 
>>>>> Dear All, I recently deployed 10PB+ Lustre solution which is working 
>>>>> fine. Recently for  genomic pipeline we acquired another racks with 
>>>>> dedicated compute nodes with single 24-NVME SSD Servers/Rack .  Each SSD 
>>>>> server connected to Compute Node via 100 G Omnipath.
>>>>> 
>>>>> 
>>>>> 
>>>>> Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are 
>>>>> not linearly scaling in terms of performance . for e..g single SSD write 
>>>>> speed is 1.3GB/sec , adding 5 of those in stripe mode should give us 
>>>>> atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD .
>>>>> 
>>>>> 
>>>>> 
>>>>> Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to 
>>>>> compute nodes distributed and parallel wise , NFS not an option .. tried 
>>>>> glusterfs but due to its DHT it is slow..
>>>>> 
>>>>> 
>>>>> 
>>>>> I am thinking to add another Filesystem to our existing MDT and install 
>>>>> OSTs/OSS over the NVME server.. mounting this specific ssd where needed. 
>>>>> so basically we will end up having two filesystem (one with normal 10PB+ 
>>>>> and 2nd with SSD)..
>>>>> 
>>>>> 
>>>>> Does this sounds correct ?
>>>>> 
>>>>> 
>>>>> 
>>>>> any other advice please ..

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to