Здравствуйте!
On Fri, Oct 30, 2015 at 09:30:40PM +0000, moloney wrote:
> Hi,
> I recently got my first Ceph cluster up and running and have been doing some
> stress tests. I quickly found that during sequential write benchmarks the
> throughput would often drop to zero. Initially I saw this inside QEMU virtual
> machines, but I can also reproduce the issue with "rados bench" within 5-10
> minutes of sustained writes. If left alone the writes will eventually start
> going again, but it takes quite a while (at least a couple minutes). If I
> stop and restart the benchmark the write throughput will immediately be where
> it is supposed to be.
> I have convinced myself it is not a network hardware issue. I can load up
> the network with a bunch of parallel iperf benchmarks and it keeps chugging
> along happily. When the issue occurs with Ceph I don't see any indications of
> network issues (e.g. dropped packets). Adding additional network load during
> the rados bench (using iperf) doesn't seem to trigger the issue any faster or
> more often.
> I have also convinced myself it isn't an issue with a journal getting full or
> an OSD being too busy. The amount of data being written before the problem
> occurs is much larger than the total journal capacity. Watching the load on
> the OSD servers with top/iostat I don't seen anything being overloaded,
> rather I see the load everywhere drop to essentially zero when the writes
> stall. Before the writes stall the load is well distributed with no visible
> hot spots. The OSDs and hosts that report slow requests are random, so I
> don't think it is a failing disk or server. I don't see anything interesting
> going on in the logs so far (I am just about to do some tests with Ceph's
> debug logging cranked up).
> The cluster specs are:
> OS: Ubuntu 14.04 with 3.16 kernel
> Ceph: 9.1.0
> OSD Filesystem: XFS
> Replication: 3X
> Two racks with IPoIB network
> 10Gbps Ethernet between racks
> 8 OSD servers with:
> * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
> * 128GB RAM
> * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
> * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions
> for OSD journals each)
> * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
> 3 Mons collocated on OSD servers
> Any advice is greatly appreciated. I am planning to try this with Hammer too.
I had the same trouble with Hammer, Ubuntu 14.04 and 3.19 kernel on Supermicro
X9DRL-3F/iF with Intel 82599ES, bounded into one links to 2 different Cisco
Nexus 5020. It was finally fixed with dropping down MTU from 1500+ to 1500.
It was working with 9000 and folowing sysctls, but after several weeks trouble
repeated and I had to drop mtu down again:
net.ipv4.tcp_rmem= 1024000 8738000 1677721600
net.ipv4.tcp_wmem= 1024000 8738000 1677721600
net.ipv4.tcp_mem= 1024000 8738000 1677721600
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_max_syn_backlog = 150000
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
vm.swappiness = 1
net.ipv4.tcp_moderate_rcvbuf = 0
All
> Thanks,
> Brendan
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
WBR, Max A. Krasilnikov
ColoCall Data Center
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com