Thank you for your comment. I know what does mean option oflag=direct and other
things about stress testing.
Unfortunately speed is very slow for this cluster FS.
The same test on another cluster FS(GPFS) which consist of 4 disks
# dd if=/dev/zero|pv|dd oflag=direct of=99999 bs=4k count=10k
40.1MB 0:00:05 [7.57MB/s] [ <=>
]
10240+0 records in
10240+0 records out
41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s
I hope that I miss some options during configuration or something else.
--
Best Regards,
Stanislav Butkeev
15.10.2015, 22:36, "John Spray" <[email protected]>:
> On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <[email protected]> wrote:
>> Hello John
>>
>> Yes, of course, write speed is rising, because we are increasing amount of
>> data per one operation by disk.
>> But, do you know at least one software which write data by 1Mb blocks? I
>> don't know, you too.
>
> Plenty of applications do large writes, especially if they're intended
> for use on network filesystems.
>
> When you pass oflag=direct, you are asking the kernel to send these
> writes individually instead of aggregating them in the page cache.
> What you're measuring here is effectively the issue rate of small
> messages, rather than the speed at which data can be written to ceph.
>
> Try the same benchmark with NFS, you'll get a similar scaling with block size.
>
> Cheers,
> John
>
> If you want to aggregate these writes in the page cache before sending
> them over the network, I imagine you probably need to disable direct
> IO.
>
>> Simple test: dd to common 2Tb SATA disk
>>
>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>> 4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>
>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
>> dd: warning: partial read (24576 bytes); suggest iflag=fullblock
>> 319MiB 0:00:03 [ 103MiB/s] [ <=> ]
>> 10219+21 records in
>> 10219+21 records out
>> 335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s
>>
>> One SATA disk has better rate than cephfs which consist of 24 the same
>> disks.
>>
>> --
>> Best Regards,
>> Stanislav Butkeev
>>
>> 15.10.2015, 21:49, "John Spray" <[email protected]>:
>>> On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <[email protected]> wrote:
>>>> Hello all,
>>>> Does anybody try to use cephfs?
>>>>
>>>> I have two servers with RHEL7.1(latest kernel
>>>> 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal
>>>> and 12*2Tb SATA disk for data.
>>>> I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>>>>
>>>> Cluster version
>>>> # ceph -v
>>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>>
>>>> Cluster config
>>>> # cat /etc/ceph/ceph.conf
>>>> [global]
>>>> auth service required = cephx
>>>> auth client required = cephx
>>>> auth cluster required = cephx
>>>> fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>> mon osd full ratio = .95
>>>> mon osd nearfull ratio = .90
>>>> osd pool default size = 2
>>>> osd pool default min size = 1
>>>> osd pool default pg num = 32
>>>> osd pool default pgp num = 32
>>>> max open files = 131072
>>>> osd crush chooseleaf type = 1
>>>> [mds]
>>>>
>>>> [mds.a]
>>>> host = ak34
>>>>
>>>> [mon]
>>>> mon_initial_members = a,b
>>>>
>>>> [mon.a]
>>>> host = ak34
>>>> mon addr = 172.24.32.134:6789
>>>>
>>>> [mon.b]
>>>> host = ak35
>>>> mon addr = 172.24.32.135:6789
>>>>
>>>> [osd]
>>>> osd journal size = 1000
>>>>
>>>> [osd.0]
>>>> osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>>>> host = ak34
>>>> public addr = 172.24.32.134
>>>> osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
>>>> .....
>>>>
>>>> Below tree of cluster
>>>> # ceph osd tree
>>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>> -1 45.75037 root default
>>>> -2 45.75037 region RU
>>>> -3 45.75037 datacenter ru-msk-ak48t
>>>> -4 22.87518 host ak34
>>>> 0 1.90627 osd.0 up 1.00000 1.00000
>>>> 1 1.90627 osd.1 up 1.00000 1.00000
>>>> 2 1.90627 osd.2 up 1.00000 1.00000
>>>> 3 1.90627 osd.3 up 1.00000 1.00000
>>>> 4 1.90627 osd.4 up 1.00000 1.00000
>>>> 5 1.90627 osd.5 up 1.00000 1.00000
>>>> 6 1.90627 osd.6 up 1.00000 1.00000
>>>> 7 1.90627 osd.7 up 1.00000 1.00000
>>>> 8 1.90627 osd.8 up 1.00000 1.00000
>>>> 9 1.90627 osd.9 up 1.00000 1.00000
>>>> 10 1.90627 osd.10 up 1.00000 1.00000
>>>> 11 1.90627 osd.11 up 1.00000 1.00000
>>>> -5 22.87518 host ak35
>>>> 12 1.90627 osd.12 up 1.00000 1.00000
>>>> 13 1.90627 osd.13 up 1.00000 1.00000
>>>> 14 1.90627 osd.14 up 1.00000 1.00000
>>>> 15 1.90627 osd.15 up 1.00000 1.00000
>>>> 16 1.90627 osd.16 up 1.00000 1.00000
>>>> 17 1.90627 osd.17 up 1.00000 1.00000
>>>> 18 1.90627 osd.18 up 1.00000 1.00000
>>>> 19 1.90627 osd.19 up 1.00000 1.00000
>>>> 20 1.90627 osd.20 up 1.00000 1.00000
>>>> 21 1.90627 osd.21 up 1.00000 1.00000
>>>> 22 1.90627 osd.22 up 1.00000 1.00000
>>>> 23 1.90627 osd.23 up 1.00000 1.00000
>>>>
>>>> Status of cluster
>>>> # ceph -s
>>>> cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>> health HEALTH_OK
>>>> monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
>>>> election epoch 10, quorum 0,1 a,b
>>>> mdsmap e14: 1/1/1 up {0=a=up:active}
>>>> osdmap e194: 24 osds: 24 up, 24 in
>>>> pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects
>>>> 545 GB used, 44132 GB / 44678 GB avail
>>>> 384 active+clean
>>>>
>>>> Pools for cephfs
>>>> ]# ceph osd dump|grep pg
>>>> pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0
>>>> object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags
>>>> hashpspool crash_replay_interval 45 stripe_width 0
>>>> pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0
>>>> object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool
>>>> stripe_width 0
>>>>
>>>> Rados bench
>>>> # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p
>>>> cephfs_data 300 seq
>>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds
>>>> or 0 objects
>>>> Object prefix: benchmark_data_XXXXXXXXXXXXXXXXXXXX_8108
>>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>> 0 0 0 0 0 0 - 0
>>>> 1 16 170 154 615.74 616 0.109984 0.0978277
>>>> 2 16 335 319 637.817 660 0.0623079 0.0985001
>>>> 3 16 496 480 639.852 644 0.0992808 0.0982317
>>>> 4 16 662 646 645.862 664 0.0683485 0.0980203
>>>> 5 16 831 815 651.796 676 0.0773545 0.0973635
>>>> 6 15 994 979 652.479 656 0.112323 0.096901
>>>> 7 16 1164 1148 655.826 676 0.107592 0.0969845
>>>> 8 16 1327 1311 655.335 652 0.0960067 0.0968445
>>>> 9 16 1488 1472 654.066 644 0.0780589 0.0970879
>>>>
>>>> .....
>>>> 297 16 43445 43429 584.811 596 0.0569516 0.109399
>>>> 298 16 43601 43585 584.942 624 0.0707439 0.109388
>>>> 299 16 43756 43740 585.059 620 0.20408 0.109363
>>>> 2015-10-15 14:16:59.622610min lat: 0.0109677 max lat: 0.951389 avg lat:
>>>> 0.109344
>>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>> 300 13 43901 43888 585.082 592 0.0768806 0.109344
>>>> Total time run: 300.329089
>>>> Total reads made: 43901
>>>> Read size: 4194304
>>>> Bandwidth (MB/sec): 584.705
>>>>
>>>> Average Latency: 0.109407
>>>> Max latency: 0.951389
>>>> Min latency: 0.0109677
>>>>
>>>> But real write speed is very low
>>>>
>>>> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=4k count=10k
>>>> 10240+0 records in1.5MiB/s] [ <=> ]
>>>> 10240+0 records out
>>>> 41943040 bytes (42 MB) copied, 25.9155 s, 1.6 MB/s
>>>> 40.1MiB 0:00:25 [1.55MiB/s] [ <=> ]
>>>>
>>>> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=32k count=10k
>>>> 10240+0 records in0.5MiB/s] [ <=> ]
>>>> 10240+0 records out
>>>> 335544320 bytes (336 MB) copied, 28.2998 s, 11.9 MB/s
>>>> 320MiB 0:00:28 [11.3MiB/s] [ <=> ]
>>>
>>> So what happens if you continue increasing the 'bs' parameter? Is
>>> bs=1M nice and fast?
>>>
>>> John
>>>
>>>> Do you know of root cause of low speed of write to FS?
>>>>
>>>> Thank you for help in advance!!
>>>>
>>>> --
>>>> Best Regards,
>>>> Stanislav Butkeev
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com