Re: [ceph-users] NFS interaction with RBD
Hi George Well that’s strange. I wonder why our systems behave so differently. We’ve got: Hypervisors running on Ubuntu 14.04. VMs with 9 ceph volumes: 2TB each. XFS instead of your ext4 Maybe the number of placement groups plays a major role as well. Jens-Christian may be able to give you the specifics of our ceph cluster. I’m about to leave on vacation and don’t have time to look that up anymore. Best regards Christian On 29 May 2015, at 14:42, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: All, I 've tried to recreate the issue without success! My configuration is the following: OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64) QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB OSDs equally distributed on two disk nodes, 3xMonitors OpenStack Cinder has been configured to provide RBD Volumes from Ceph. I have created 10x 500GB Volumes which were then all attached at a single Virtual Machine. All volumes were formatted two times for comparison reasons, one using mkfs.xfs and one using mkfs.ext4. I did try to issue the commands all at the same time (or as possible to that). In both tests I didn't notice any interruption. It may took longer than just doing one at a time but the system was continuously up and everything was responding without the problem. At the time of these processes the open connections were 100 with one of the OSD node and 111 with the other one. So I guess I am not experiencing the issue due to the low number of OSDs I am having. Is my assumption correct? Best regards, George Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the mkfs.xfs command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of the RBD Volumes cause I have a feeling that mine were two small, and if so I have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, George Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is
Re: [ceph-users] NFS interaction with RBD
Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar) can be VERY helpful to identify if something is going wrong and what it is in particular. Otherwise run atop on your storage nodes to identify if CPU, network, specific HDDs/OSDs are bottlenecks. Deep scrubbing can be _very_ taxing, do your problems persist if inject into your running cluster an osd_scrub_sleep value of 0.5 (lower