Re: [ceph-users] Write IO Problem

Christian Balzer Mon, 23 Mar 2015 18:29:18 -0700

Hello,

If you had used "performance" or "slow" in your subject future generations
would be able find this thread and what it is about more easily. ^_-


Also, check the various "SSD" + "performance" threads in the ML archives.

On Fri, 20 Mar 2015 14:13:19 +0000 Rottmann Jonas wrote:

> Hi,
> 
> We have a huge write IO Problem in our preproductive Ceph Cluster. First
> our Hardware:
> 
You're not telling us your Ceph version, but from the tunables below I
suppose it is Firefly?
If you have the time, it would definitely be advisable to wait for Hammer
with an all SSD cluster.

> 4 OSD Nodes with:
> 
> Supermicro X10 Board
> 32GB DDR4 RAM
> 2x Intel Xeon E5-2620
> LSI SAS 9300-8i Host Bus Adapter
> Intel Corporation 82599EB 10-Gigabit
> 2x Intel SSDSA2CT040G3 in software raid 1 for system
> 
Nobody really knows what those inane Intel product codes are without
looking them up. 
So you have 2 Intel 320 40GB consumer SSDs that are EOL'ed for the OS.
In a very modern, up to date system otherwise...

When you say "pre-production" cluster up there, does that mean that this
is purely a test bed, or are you planning to turn this into production
eventually?

> Disks:
> 2x Samsung EVO 840 1TB
> 
Unless you're planning to do _very_ little writes, these will wear out in
no time. 
With small IOPS (4KB) you can see up to 12x write amplification with Ceph.
Consider investing in data center level SSDs like the 845 DC PRO or
comparable Intel (S3610, S3700).


> So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only
> added nodiratime)
> 
Why BTRFS?
As in, what made you feel that this was a good, safe choice?
I guess with SSDs for backing storage you won't at least have to worry
about the massive fragmentation of BTRFS with Ceph...

> Benchmarking one disk alone gives good values:
> 
> dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
> 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s
> 
> Fio 8k libaio depth=32:
> write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec
>
And this is where you start comparing apples to oranges.
That fio was with 8KB blocks and 32 threads.
 
> Here our ceph.conf (pretty much standard):
> 
> [global]
> fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
> mon initial members = cephasp41,ceph-monitor41
> mon host = 172.30.10.15,172.30.10.19
> public network = 172.30.10.0/24
> cluster network = 172.30.10.0/24
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> 
> #Default is 1GB, which is fine for us
> #osd journal size = {n}
> 
> #Only needed if ext4 comes to play
> #filestore xattr use omap = true
> 
> osd pool default size = 3  # Write an object n times.
> osd pool default min size = 2 # Allow writing n copy in a degraded state.
> 
Normally I'd say a replication of 2 is sufficient with SSDs, but given
your choice of SSDs I'll refrain from that.

> #Set individual per pool by a formula
> #osd pool default pg num = {n}
> #osd pool default pgp num = {n}
> #osd crush chooseleaf type = {n}
> 
> 
> When I benchmark the cluster with "rbd bench-write rbd/fio" I get pretty
> good results: elapsed:    18  ops:   262144  ops/sec: 14466.30
> bytes/sec: 59253946.11
> 
Apple and oranges time again, this time you're testing with 4K blocks and
16 threads (defaults for this test).

Incidentally, I get this from a 3 node cluster (replication 3) with 8 OSDs
per node (SATA disk, journals on 4 Intel DC S3700 100GB) and Infiniband
(4QDR) interconnect:
elapsed:     7  ops:   246724  ops/sec: 31157.87  bytes/sec: 135599456.06

> If I for example bench i.e. with fio with rbd engine, I get very poor
> results:
> 
> [global]
> ioengine=rbd
> clientname=admin
> pool=rbd
> rbdname=fio
> invalidate=0    # mandatory
> rw=randwrite
> bs=512k
> 
> [rbd_iodepth32]
> iodepth=32
> 
> RESULTS:
> ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec
>
Total apples and oranges time, now you're having 512KB blocks (which of
course will reduce IOPS) and 32 threads.
The bandwidth is still about the same as before and if you multiply
105x128(to compensate for 4KB blocks) you wind with 13440, close to what
you've seen with the rbd bench. 
Also from where are you benching?
 
> Also if I mount the rbd with kernel as rbd0, format it with ext4 and
> then do a dd on it, its not that good: "dd if=/dev/zero of=tempfile
> bs=1M count=1024 conv=fdatasync,notrunc" RESULT:
> 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s
> 
Mounting it where? 
Same system that you did the other tests from?

Did you format it w/o lazy init or waited until the lazy init finished
before doing the test?

> I also tried presenting an rbd image with tgtd, mount it onto VMWare
> ESXi and test it in a vm, there I got only round about 50 iops with 4k,
> and writing sequentiell 25Mbytes. With NFS the read sequential values
> are good (400Mbyte/s) but writing only 25Mbyte/s.
>
Can't really comment on that, many things that could cause this and I'm
not an expert in either.
 
> What I tried tweaking so far:
>
I don't think that whatever you're seeing (aside from the apple and
oranges bit) is caused by anything you tried tweaking below.

Regards,

Christian 
> Intel NIC optimazitions:
> /etc/sysctl.conf
> 
> # Increase system file descriptor limit
> fs.file-max = 65535
> 
> # Increase system IP port range to allow for more concurrent connections
> net.ipv4.ip_local_port_range = 1024 65000
> 
> # -- 10gbe tuning from Intel ixgb driver README -- #
> 
> # turn off selective ACK and timestamps
> net.ipv4.tcp_sack = 0
> net.ipv4.tcp_timestamps = 0
> 
> # memory allocation min/pressure/max.
> # read buffer, write buffer, and buffer space
> net.ipv4.tcp_rmem = 10000000 10000000 10000000
> net.ipv4.tcp_wmem = 10000000 10000000 10000000
> net.ipv4.tcp_mem = 10000000 10000000 10000000
> 
> net.core.rmem_max = 524287
> net.core.wmem_max = 524287
> net.core.rmem_default = 524287
> net.core.wmem_default = 524287
> net.core.optmem_max = 524287
> net.core.netdev_max_backlog = 300000
> 
> AND
> 
> setpci -v -d 8086:10fb e6.b=2e
> 
> 
> Setting tunables to firefly:
>             ceph osd crush tunables firefly
> 
> Setting scheduler to noop:
>             This basically stopped IO on the cluster, and I had to
> revert it and restart some of the osds with requests stuck
> 
> And I tried moving the monitor from an VM to the Hardware where the OSDs
> run.
> 
> 
> Any suggestions where to look, or what could cause that problem?
> (because I can't believe your loosing that much performance through ceph
> replication)
> 
> Thanks in advance.
> 
> If you need any info please tell me.
> 
> Mit freundlichen Grüßen/Kind regards
> Jonas Rottmann
> Systems Engineer
> 
> FIS-ASP Application Service Providing und
> IT-Outsourcing GmbH
> Röthleiner Weg 4
> D-97506 Grafenrheinfeld
> Phone: +49 (9723) 9188-568
> Fax: +49 (9723) 9188-600
> 
> email: j.rottm...@fis-asp.de <mailto:j.rottm...@fis-asp.de>  web:
> www.fis-asp.de
> 
> Geschäftsführer Robert Schuhmann
> Registergericht Schweinfurt HRB 3865


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write IO Problem

Reply via email to