All, I 've tried to recreate the issue without success!
My configuration is the following: OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64) QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB OSDs equally distributed on two disk nodes, 3xMonitors
OpenStack Cinder has been configured to provide RBD Volumes from Ceph.I have created 10x 500GB Volumes which were then all attached at a single Virtual Machine.
All volumes were formatted two times for comparison reasons, one using "mkfs.xfs" and one using "mkfs.ext4". I did try to issue the commands all at the same time (or as possible to that).
In both tests I didn't notice any interruption. It may took longer than just doing one at a time but the system was continuously up and everything was responding without the problem.
At the time of these processes the open connections were 100 with one of the OSD node and 111 with the other one.
So I guess I am not experiencing the issue due to the low number of OSDs I am having. Is my assumption correct?
Best regards, George
Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the "mkfs.xfs" command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of theRBD Volumes cause I have a feeling that mine were two small, and if soI have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, GeorgeHi GeorgeIn order to experience the error it was enough to simply run mkfs.xfson all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ...Max open files 1024 4096 files.. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer <jens-christian.fisc...@switch.ch> wrote:George,I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives.we are using Qemu 2.0: $ dpkg -l | grep qemuii ipxe-qemu 1.0.0+git-20131111.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps 2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) ii qemu-system-mips 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) ii qemu-system-misc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilitiescheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/storiesOn 26.05.2015, at 19:12, Georgios Dimitrakakis <gior...@acmac.uoc.gr> wrote:Jens-Christian,how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that?In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-(What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpmBest regards, GeorgeI think we (i.e. Christian) found the problem:We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 secondtimeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote:Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well).The only common thing is that both of you added OSDs and are likelysuffering from delays stemming from Ceph re-balancing or deep-scrubbing.Ceph logs will only pipe up when things have been blocked for morethan 30seconds, NFS might take offense to lower values (or the accumulationof several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network.Were these added to the existing 16 nodes, are these on new storagenodes(so could there be something different with those nodes?), how busyis your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar)can be VERY helpful to identify if something is going wrong and whatit is in particular. Otherwise run atop on your storage nodes to identify if CPU, network, specific HDDs/OSDs are bottlenecks. Deep scrubbing can be _very_ taxing, do your problems persist if injectinto your running cluster an "osd_scrub_sleep" value of "0.5" (lowerthatuntil it hurts again) or if you turn of deep scrubs altogether forthe moment? Christian On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:We see something very similar on our Ceph cluster, starting as oftoday. We use a 16 node, 102 OSD Ceph installation as the basis for an IcehouseOpenStack cluster (we applied the RBD patches for live migrationetc)On this cluster we have a big ownCloud installation (Sync & Share)that stores its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing them to around 10 web server VMs (we originally started with one NFS server with a 100TB volume, but that has become unwieldy).All of the servers (hypervisors, ceph storage nodes and VMs) areusing Ubuntu 14.04Yesterday evening we added 23 ODSs to the cluster bringing it upto 125OSDs (because we had 4 OSDs that were nearing the 90% full mark).Therebalancing process ended this morning (after around 12 hours) Thecluster has been clean since then: cluster b1f3f4c8-xxxxx health HEALTH_OK monmap e2: 3 mons at{zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmape43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail3319 active+clean 17 active+clean+scrubbing+deep client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/sAt midnight, we run a script that creates an RBD snapshot of allRBD volumes that are attached to the NFS servers (for backup purposes). Looking at our monitoring, around that time, one of the NFS servers became unresponsive and took down the complete ownCloud installation (load on the web server was > 200 and they had lost some of the NFS mounts)Rebooting the NFS server solved that problem, but the NFS kernelserver kept crashing all day long after having run between 10 to 90 minutes.We initially suspected a corrupt rbd volume (as it seemed that wecould trigger the kernel crash by just “ls -l” one of the volumes, but subsequent “xfs_repair -n” checks on those RBD volumes showed no problems. We migrated the NFS server off of its hypervisor, suspecting a problemwith RBD kernel modules, rebooted the hypervisor but the problem persisted (both on the new hypervisor, and on the old one when wemigrated it back) We changed the /etc/default/nfs-kernel-server to start up 256 servers(even though the defaults had been working fine for over a year)Only one of our 3 NFS servers crashes (see below for syslog information) - the other 2 have been fine May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1 rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core TeamMay23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version0.5.0 (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:[ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1/dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1collectd[1872]: python: Plugin loaded but not configured. May 2321:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] init: plymouth-upstart-bridge main process ended, respawning May 23 21:51:26 drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked for more than 120 seconds.May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:[ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"disables this message. May 23 21:51:26 drive-nfs1 kernel: [ 600.781504] nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c500000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26 drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:[ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515]Call Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26 drive-nfs1 kernel: [ 600.781526] [] __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel: [ 600.781528] [] mutex_lock+0x1f/0x2f May 23 21:51:26 drive-nfs1 kernel: [ 600.781557] [] nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781591] []nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:[ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781628] []nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662] [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23 21:51:26 drive-nfs1 kernel: [ 600.781678] [] svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel: [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:[ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26 drive-nfs1 kernel: [ 600.781707] [] ? kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel: [ 600.781712] [] ret_from_fork+0x58/0x90 May 23 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ? kthread_create_on_node+0x1c0/0x1c0 Before each crash, we see the disk utilization of one or two randommounted RBD volumes to go to 100% - there is no pattern on whichof the RBD disks start to act up.We have scoured the log files of the Ceph cluster for any signs ofproblems but came up empty.The NFS server has almost no load (compared to regular usage) asmost sync clients are either turned off (weekend) or have given up connecting to the server. There haven't been any configuration change on the NFS servers prior to the problems. The only change was the adding of 23 OSDs. We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) Our team is completely out of ideas. We have removed the 100TB volumefrom the nfs server (we used the downtime to migrate the last dataoff of it to one of the smaller volumes). The NFS server has been running for 30 minutes now (with close to no load) but we don’t really expect it to make it until tomorrow. send help Jens-Christian-- Christian Balzer Network/Systems Engineer ch...@gol.com [1] Global OnLine Japan/Fusion Communications http://www.gol.com/ [2]Links: ------ [1] mailto:ch...@gol.com [2] http://www.gol.com/ [3] mailto:jens-christian.fisc...@switch.ch [4] mailto:ch...@gol.com--
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com