Hi,
        I changed to only use the infiniband network. For the 4KB write, the 
IOPS doesn’t improve much. I also logged into the OSD nodes and atop showed the 
disks are not always at 100% busy. Please check a snapshot of one node below:

DSK |          sdc  | busy     72% |  read    20/s |  write   86/s | KiB/w     
13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
DSK |          sda  | busy     47% |  read     0/s |  write  589/s | KiB/w      
4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
DSK |          sdb  | busy     31% |  read    14/s |  write   77/s | KiB/w     
10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
DSK |          sdd  | busy     19% |  read     4/s |  write   50/s | KiB/w     
11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
NET | transport     | tcpi   656/s |  tcpo   655/s |  udpi     0/s | udpo     
0/s  | tcpao    0/s |  tcppo    0/s |  tcprs    0/s |
NET | network       | ipi    657/s |  ipo    655/s |  ipfrw    0/s | deliv  
657/s  |              |  icmpi    0/s |  icmpo    0/s |
NET | p10p1     0%  | pcki     0/s |  pcko     0/s |  si    0 Kbps | so    1 
Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
NET | ib0     ----  | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 
Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
NET | lo      ----  | pcki    19/s |  pcko    19/s |  si   14 Kbps | so   14 
Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
        
        /dev/sda is the OS and journaling SSD. The other three are OSDs.

        Am I missing anything?

        Thanks,

        

        
Zhang, Di
Postdoctoral Associate
Baylor College of Medicine

> On Jul 13, 2016, at 6:56 PM, Christian Balzer <[email protected]> wrote:
> 
> 
> Hello,
> 
> On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> 
>> I also tried 4K write bench. The IOPS is ~420. 
> 
> That's what people usually mean (4KB blocks) when talking about IOPS.
> This number is pretty low, my guess would be network latency on your 1Gbs
> network for the most part.
> 
> You should run atop on your storage nodes will running a test like this
> and see if the OSDs (HDDs) are also very busy.
> 
> Lastly the rados bench gives you some basic numbers but it is not the same
> as real client I/O, for that you want to run fio inside a VM or in your
> case on a mounted CephFS.
> 
>> I used to have better
>> bandwidth when I use the same network for both the cluster and clients. Now
>> the bandwidth must be limited by the 1G ethernet. 
> That's the bandwidth you also see in your 4MB block tests below.
> For small I/Os the real killer is latency, though.
> 
>> What would you suggest to
>> me to do?
>> 
> That depends on your budget mostly (switch ports, client NICs).
> 
> A uniform, single 10Gb/s network would be better in all aspects than the
> split network you have now.
> 
> Christian
> 
>> Thanks,
>> 
>> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <[email protected]> wrote:
>> 
>>> Hello,
>>>    Sorry for the misunderstanding about IOPS. Here are some summary stats
>>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
>>> 
>>> ceph osd pool create test 512 512
>>> 
>>> rados bench -p test 10 write --no-cleanup
>>> 
>>> Total time run:         10.480383
>>> Total writes made:      288
>>> Write size:             4194304
>>> Object size:            4194304
>>> Bandwidth (MB/sec):     109.92
>>> Stddev Bandwidth:       11.9926
>>> Max bandwidth (MB/sec): 124
>>> Min bandwidth (MB/sec): 80
>>> Average IOPS:           27
>>> Stddev IOPS:            3
>>> Max IOPS:               31
>>> Min IOPS:               20
>>> Average Latency(s):     0.579105
>>> Stddev Latency(s):      0.19902
>>> Max latency(s):         1.32831
>>> Min latency(s):         0.245505
>>> 
>>> rados bench -p bench -p test 10 seq
>>> Total time run:       10.340724
>>> Total reads made:     288
>>> Read size:            4194304
>>> Object size:          4194304
>>> Bandwidth (MB/sec):   111.404
>>> Average IOPS          27
>>> Stddev IOPS:          2
>>> Max IOPS:             31
>>> Min IOPS:             22
>>> Average Latency(s):   0.564858
>>> Max latency(s):       1.65278
>>> Min latency(s):       0.141504
>>> 
>>> rados bench -p bench -p test 10 rand
>>> Total time run:       10.546251
>>> Total reads made:     293
>>> Read size:            4194304
>>> Object size:          4194304
>>> Bandwidth (MB/sec):   111.13
>>> Average IOPS:         27
>>> Stddev IOPS:          2
>>> Max IOPS:             32
>>> Min IOPS:             24
>>> Average Latency(s):   0.57092
>>> Max latency(s):       1.8631
>>> Min latency(s):       0.161936
>>> 
>>> 
>>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <[email protected]> wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
>>>> 
>>>>> I am using 10G infiniband for cluster network and 1G ethernet for
>>>> public.
>>>> Hmm, very unbalanced, but I guess that's HW you already had.
>>>> 
>>>>> Because I don't have enough slots on the node, so I am using three
>>>> files on
>>>>> the OS drive (SSD) for journaling, which really improved but not
>>>> entirely
>>>>> solved the problem.
>>>>> 
>>>> If you can, use partitions instead of files, less overhead.
>>>> What model SSD is that?
>>>> 
>>>> Also putting the meta-data pool on SSDs might help.
>>>> 
>>>>> I am quite happy with the current IOPS, which range from 200 MB/s to 400
>>>>> MB/s sequential write, depending on the block size.
>>>> That's not IOPS, that's bandwidth, throughput.
>>>> 
>>>>> But the problem is,
>>>>> when I transfer data to the cephfs at a rate below 100MB/s, I can
>>>> observe
>>>>> the slow/blocked requests warnings after a few minutes via "ceph -w".
>>>> 
>>>> I doubt the speed has anything to do with this, but the actual block size
>>>> and IOPS numbers.
>>>> 
>>>> As always, watch your storage nodes with atop (or iostat) during such
>>>> scenarios/tests and spot the bottlenecks.
>>>> 
>>>>> It's
>>>>> not specific to any particular OSDs. So I started to doubt if the
>>>>> configuration is correct or upgrading to Jewel can solve it.
>>>>> 
>>>> Jewel is likely to help in general, but can't fix insufficient HW or
>>>> broken configurations.
>>>> 
>>>>> There are about 5,000,000 objects currently in the cluster.
>>>>> 
>>>> You're robably not hitting his, but read the recent filestore merge and
>>>> split threads, including the entirety of this thread:
>>>> https://www.mail-archive.com/[email protected]/msg29243.html
>>>> 
>>>> Christian
>>>> 
>>>>> Thanks for the hints.
>>>>> 
>>>>> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
>>>>>> 
>>>>>>> It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
>>>> for both
>>>>>>> cephfs_data and cephfs_metadata. I experienced some slow/blocked
>>>> requests
>>>>>>> issues when I was using hammer 0.94.x and prior. So I was thinking
>>>> if the
>>>>>>> pg_num is too large for metadata.
>>>>>> 
>>>>>> Very, VERY much doubt this.
>>>>>> 
>>>>>> Your "ideal" values for a cluster of this size (are you planning to
>>>> grow
>>>>>> it?) would be about 1024 PGs for data and 128 or 256 PGs for
>>>> meta-data.
>>>>>> 
>>>>>> Not really that far off and more importantly not overloading the OSDs
>>>> with
>>>>>> too many PGs in total. Or do you have more pools?
>>>>>> 
>>>>>> 
>>>>>>> I just upgraded the cluster to Jewel
>>>>>>> today. Will watch if the problem remains.
>>>>>>> 
>>>>>> Jewel improvements might mask things, but I'd venture that your
>>>> problems
>>>>>> were caused by your HW not being sufficient for the load.
>>>>>> 
>>>>>> As in, do you use SSD journals, etc?
>>>>>> How many IOPS do you need/expect from your CephFS?
>>>>>> How many objects are in there?
>>>>>> 
>>>>>> Christian
>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <[email protected]
>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> I'm not at all sure that rados cppool actually captures
>>>> everything (it
>>>>>>>> might). Doug has been working on some similar stuff for disaster
>>>>>>>> recovery testing and can probably walk you through moving over.
>>>>>>>> 
>>>>>>>> But just how large *is* your metadata pool in relation to others?
>>>>>>>> Having a too-large pool doesn't cost much unless it's
>>>>>>>> grossly-inflated, and having a nice distribution of your folders
>>>> is
>>>>>>>> definitely better than not.
>>>>>>>> -Greg
>>>>>>>> 
>>>>>>>> On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <[email protected]>
>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>>    Is there any way to change the metadata pool for a cephfs
>>>> without
>>>>>>>> losing
>>>>>>>>> any existing data? I know how to clone the metadata pool using
>>>> rados
>>>>>>>> cppool.
>>>>>>>>> But the filesystem still links to the original metadata pool no
>>>>>> matter
>>>>>>>> what
>>>>>>>>> you name it.
>>>>>>>>> 
>>>>>>>>>    The motivation here is to decrease the pg_num of the
>>>> metadata
>>>>>> pool. I
>>>>>>>>> created this cephfs cluster sometime ago, while I didn't realize
>>>>>> that I
>>>>>>>>> shouldn't assign a large pg_num to such a small pool.
>>>>>>>>> 
>>>>>>>>>    I'm not sure if I can delete the fs and re-create it using
>>>> the
>>>>>>>> existing
>>>>>>>>> data pool and the cloned metadata pool.
>>>>>>>>> 
>>>>>>>>>    Thank you.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Zhang Di
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>> [email protected]           Global OnLine Japan/Rakuten Communications
>>>>>> http://www.gol.com/
>>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Christian Balzer        Network/Systems Engineer
>>>> [email protected]           Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>>> 
>>> 
>>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> [email protected] <mailto:[email protected]>          Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to