Hi Tom,

thank you very much for your hint regarding tcp_sack and sysctl network stack 
tuning. This pointed me in the right direction.

We had seldom similar issues where under high network load reads stalled on 
osds.

Enabling tcp_sack made the situation better for us and some more tuning 
completely solved the issue for us.

I learned one more time that you need absolutely clean and fast networking for 
ceph and that ceph uses resources much more than any other network software.

However, I think ceph should be designed more fault tolerant regarding minor 
network issues since minor problems and a few lost packets can alwyas happen.

Thanks
  Christoph

On Tue, Aug 02, 2016 at 07:14:27PM +0000, Helander, Thomas wrote:
> Hi David,
> 
> There’s a good amount of backstory to our configuration, but I’m happy to 
> report I found the source of my problem.
> 
> We were applying some “optimizations” for our 10GbE via sysctl, including 
> disabling net.ipv4.tcp_sack. Re-enabling net.ipv4.tcp_sack resolved the issue.
> 
> Thanks,
> Tom
> 
> From: David Turner [mailto:[email protected]]
> Sent: Monday, August 01, 2016 12:06 PM
> To: Helander, Thomas <[email protected]>; 
> [email protected]
> Subject: RE: Read Stalls with Multiple OSD Servers
> 
> Why are you running Raid 6 osds?  Ceph's usefulness is a lot of osds that can 
> fail and be replaced.  With your processors/ram, you should be running these 
> as individual osds.  That will utilize your dual processor setup much more.  
> Ceph is optimal for 1 core per osd.  Extra cores are more or less wasted in 
> the storage node.  You only have 2 storage nodes, so you can't utilize a lot 
> of the benefits of Ceph.  Your setup looks like you're much better suited for 
> a Gluster cluster instead of a Ceph cluster.  I don't know what your needs 
> are, but that's what it looks like from here.
> ________________________________
> [cid:[email protected]]<https://storagecraft.com>
> 
> David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation<https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
> 
> ________________________________
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> ________________________________
> ________________________________
> From: Helander, Thomas [[email protected]]
> Sent: Monday, August 01, 2016 11:10 AM
> To: David Turner; [email protected]<mailto:[email protected]>
> Subject: RE: Read Stalls with Multiple OSD Servers
> Hi David,
> 
> Thanks for the quick response and suggestion. I do have just a basic network 
> config (one network, no VLANs) and am able to ping between the storage 
> servers using hostnames and IPs.
> 
> Thanks,
> Tom
> 
> From: David Turner [mailto:[email protected]]
> Sent: Monday, August 01, 2016 9:14 AM
> To: Helander, Thomas 
> <[email protected]<mailto:[email protected]>>; 
> [email protected]<mailto:[email protected]>
> Subject: RE: Read Stalls with Multiple OSD Servers
> 
> This could be explained by your osds not being able to communicate with each 
> other.  We have 2 vlans between our storage nodes, the public and private 
> networks for ceph to use.  We added 2 new nodes in a new rack on new switches 
> and as soon as we added a single osd for one of them to the cluster, the 
> peering never finished and we had a lot of blocked requests that never went 
> away.
> 
> In testing we found that the rest of the cluster could not communicate with 
> these nodes on the private vlan and after fixing the network switch config, 
> everything worked perfectly for adding in the 2 new nodes.
> 
> If you are using a basic network configuration with only one network and/or 
> vlan, then this is likely not to be your issue.  But to check and make sure, 
> you should test pinging between your nodes on all of the IPs they have.
> ________________________________
> [cid:[email protected]]<https://storagecraft.com>
> 
> David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation<https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
> 
> ________________________________
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> ________________________________
> ________________________________
> From: ceph-users [[email protected]] on behalf of Helander, 
> Thomas [[email protected]]
> Sent: Monday, August 01, 2016 10:06 AM
> To: [email protected]<mailto:[email protected]>
> Subject: [ceph-users] Read Stalls with Multiple OSD Servers
> Hi,
> 
> I’m running a three server cluster (one monitor, two OSD) and am having a 
> problem where after adding the second OSD server, my read rate drops 
> significantly and eventually the reads stall (writes are improved as 
> expected). Attached is a log of the rados benchmarks for the two 
> configurations and below is my hardware configuration. I’m not using replicas 
> (capacity is more important than uptime for our use case) and am using a 
> single 10GbE network. The pool (rbd) is configured with 128 placement groups.
> 
> I’ve checked the CPU utilization of the ceph-osd processes and they all hover 
> around 10% until the stall. After the stall, the CPU usage is 0% and the 
> disks all show zero operations via iostat. Iperf reports 9.9Gb/s between the 
> monitor and OSD servers.
> 
> I’m looking for any advice/help on how to identify the source of this issue 
> as my attempts so far have proven fruitless…
> 
> Monitor server:
> 2x E5-2680V3
> 32GB DDR4
> 2x 4TB HDD in RAID1 on an Avago/LSI 3108 with Cachevault, configured as 
> write-back
> 10GbE
> 
> OSD servers:
> 2x E5-2680V3
> 128GB DDR4
> 2x 8+2 RAID6 using 8TB SAS12 drives on an Avago/LSI 9380 controller with 
> Cachevault, configured as write-back.
>                 - Each RAID6 is an OSD
> 10GbE
> 
> Thanks,
> 
> Tom Helander
> 
> KLA-Tencor
> One Technology Dr | M/S 5-2042R | Milpitas, CA | 95035
> 
> CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or 
> previous e-mail messages attached to it, may contain confidential 
> information. If you are not the intended recipient, or a person responsible 
> for delivering it to the intended recipient, you are hereby notified that any 
> disclosure, copying, distribution or use of any of the information contained 
> in or attached to this message is STRICTLY PROHIBITED. If you have received 
> this transmission in error, please immediately notify us by reply e-mail at 
> [email protected]<mailto:[email protected]> or by 
> telephone at (408) 875-7819, and destroy the original transmission and its 
> attachments without reading them or saving them to disk. Thank you.
> 



> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

[email protected]     Internetloesungen vom Feinsten
Fon. +49 2166 9149-32                      Fax. +49 2166 9149-10
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to