Hi Tom, thank you very much for your hint regarding tcp_sack and sysctl network stack tuning. This pointed me in the right direction.
We had seldom similar issues where under high network load reads stalled on osds. Enabling tcp_sack made the situation better for us and some more tuning completely solved the issue for us. I learned one more time that you need absolutely clean and fast networking for ceph and that ceph uses resources much more than any other network software. However, I think ceph should be designed more fault tolerant regarding minor network issues since minor problems and a few lost packets can alwyas happen. Thanks Christoph On Tue, Aug 02, 2016 at 07:14:27PM +0000, Helander, Thomas wrote: > Hi David, > > There’s a good amount of backstory to our configuration, but I’m happy to > report I found the source of my problem. > > We were applying some “optimizations” for our 10GbE via sysctl, including > disabling net.ipv4.tcp_sack. Re-enabling net.ipv4.tcp_sack resolved the issue. > > Thanks, > Tom > > From: David Turner [mailto:[email protected]] > Sent: Monday, August 01, 2016 12:06 PM > To: Helander, Thomas <[email protected]>; > [email protected] > Subject: RE: Read Stalls with Multiple OSD Servers > > Why are you running Raid 6 osds? Ceph's usefulness is a lot of osds that can > fail and be replaced. With your processors/ram, you should be running these > as individual osds. That will utilize your dual processor setup much more. > Ceph is optimal for 1 core per osd. Extra cores are more or less wasted in > the storage node. You only have 2 storage nodes, so you can't utilize a lot > of the benefits of Ceph. Your setup looks like you're much better suited for > a Gluster cluster instead of a Ceph cluster. I don't know what your needs > are, but that's what it looks like from here. > ________________________________ > [cid:[email protected]]<https://storagecraft.com> > > David Turner | Cloud Operations Engineer | StorageCraft Technology > Corporation<https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 | Mobile: 385.224.2943 > > ________________________________ > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this message > is prohibited. > > ________________________________ > ________________________________ > From: Helander, Thomas [[email protected]] > Sent: Monday, August 01, 2016 11:10 AM > To: David Turner; [email protected]<mailto:[email protected]> > Subject: RE: Read Stalls with Multiple OSD Servers > Hi David, > > Thanks for the quick response and suggestion. I do have just a basic network > config (one network, no VLANs) and am able to ping between the storage > servers using hostnames and IPs. > > Thanks, > Tom > > From: David Turner [mailto:[email protected]] > Sent: Monday, August 01, 2016 9:14 AM > To: Helander, Thomas > <[email protected]<mailto:[email protected]>>; > [email protected]<mailto:[email protected]> > Subject: RE: Read Stalls with Multiple OSD Servers > > This could be explained by your osds not being able to communicate with each > other. We have 2 vlans between our storage nodes, the public and private > networks for ceph to use. We added 2 new nodes in a new rack on new switches > and as soon as we added a single osd for one of them to the cluster, the > peering never finished and we had a lot of blocked requests that never went > away. > > In testing we found that the rest of the cluster could not communicate with > these nodes on the private vlan and after fixing the network switch config, > everything worked perfectly for adding in the 2 new nodes. > > If you are using a basic network configuration with only one network and/or > vlan, then this is likely not to be your issue. But to check and make sure, > you should test pinging between your nodes on all of the IPs they have. > ________________________________ > [cid:[email protected]]<https://storagecraft.com> > > David Turner | Cloud Operations Engineer | StorageCraft Technology > Corporation<https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 | Mobile: 385.224.2943 > > ________________________________ > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this message > is prohibited. > > ________________________________ > ________________________________ > From: ceph-users [[email protected]] on behalf of Helander, > Thomas [[email protected]] > Sent: Monday, August 01, 2016 10:06 AM > To: [email protected]<mailto:[email protected]> > Subject: [ceph-users] Read Stalls with Multiple OSD Servers > Hi, > > I’m running a three server cluster (one monitor, two OSD) and am having a > problem where after adding the second OSD server, my read rate drops > significantly and eventually the reads stall (writes are improved as > expected). Attached is a log of the rados benchmarks for the two > configurations and below is my hardware configuration. I’m not using replicas > (capacity is more important than uptime for our use case) and am using a > single 10GbE network. The pool (rbd) is configured with 128 placement groups. > > I’ve checked the CPU utilization of the ceph-osd processes and they all hover > around 10% until the stall. After the stall, the CPU usage is 0% and the > disks all show zero operations via iostat. Iperf reports 9.9Gb/s between the > monitor and OSD servers. > > I’m looking for any advice/help on how to identify the source of this issue > as my attempts so far have proven fruitless… > > Monitor server: > 2x E5-2680V3 > 32GB DDR4 > 2x 4TB HDD in RAID1 on an Avago/LSI 3108 with Cachevault, configured as > write-back > 10GbE > > OSD servers: > 2x E5-2680V3 > 128GB DDR4 > 2x 8+2 RAID6 using 8TB SAS12 drives on an Avago/LSI 9380 controller with > Cachevault, configured as write-back. > - Each RAID6 is an OSD > 10GbE > > Thanks, > > Tom Helander > > KLA-Tencor > One Technology Dr | M/S 5-2042R | Milpitas, CA | 95035 > > CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or > previous e-mail messages attached to it, may contain confidential > information. If you are not the intended recipient, or a person responsible > for delivering it to the intended recipient, you are hereby notified that any > disclosure, copying, distribution or use of any of the information contained > in or attached to this message is STRICTLY PROHIBITED. If you have received > this transmission in error, please immediately notify us by reply e-mail at > [email protected]<mailto:[email protected]> or by > telephone at (408) 875-7819, and destroy the original transmission and its > attachments without reading them or saving them to disk. Thank you. > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christoph Adomeit GATWORKS GmbH Reststrauch 191 41199 Moenchengladbach Sitz: Moenchengladbach Amtsgericht Moenchengladbach, HRB 6303 Geschaeftsfuehrer: Christoph Adomeit, Hans Wilhelm Terstappen [email protected] Internetloesungen vom Feinsten Fon. +49 2166 9149-32 Fax. +49 2166 9149-10 _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
