Re: [ceph-users] NFS interaction with RBD

Georgios Dimitrakakis Fri, 29 May 2015 05:43:40 -0700

All,

I 've tried to recreate the issue without success!


My configuration is the following:

OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64

Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047),20x4TB OSDs equally distributed on two disk nodes, 3xMonitors



OpenStack Cinder has been configured to provide RBD Volumes from Ceph.

I have created 10x 500GB Volumes which were then all attached at asingle Virtual Machine.

All volumes were formatted two times for comparison reasons, one using"mkfs.xfs" and one using "mkfs.ext4".I did try to issue the commands all at the same time (or as possible tothat).

In both tests I didn't notice any interruption. It may took longer thanjust doing one at a time but the system was continuously up andeverything was responding without the problem.

At the time of these processes the open connections were 100 with oneof the OSD node and 111 with the other one.

So I guess I am not experiencing the issue due to the low number ofOSDs I am having. Is my assumption correct?



Best regards,

George

Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a
single server without success!

I 've issued the "mkfs.xfs" command simultaneously (or at least as
fast I could do it in different terminals) without noticing any
problems. Can you please tell me what was the size of each one of the
RBD Volumes cause I have a feeling that mine were two small, and ifso
I have to test it on our bigger cluster.

I 've also thought that besides QEMU version it might also be
important the underlying OS, so what was your testbed?


All the best,

George
Hi George
In order to experience the error it was enough to simply runmkfs.xfs
on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files 1024 4096files
..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
<jens-christian.fisc...@switch.ch> wrote:
George,
I will let Christian provide you the details. As far as I know, itwas enough to just do a ‘ls’ on all of the attached drives.
we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii ipxe-qemu1.0.0+git-20131111.c3d1e78-2ubuntu1 all PXE boot firmware- ROM images for qemuii qemu-keymaps 2.0.0+dfsg-2ubuntu1.11all QEMU keyboard mapsii qemu-system 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binariesii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (arm)ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (common files)ii qemu-system-mips 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (mips)ii qemu-system-misc 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (miscelaneous)ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (ppc)ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (sparc)ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11amd64 QEMU full system emulation binaries (x86)ii qemu-utils 2.0.0+dfsg-2ubuntu1.11amd64 QEMU utilities
cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories
On 26.05.2015, at 19:12, Georgios Dimitrakakis<gior...@acmac.uoc.gr> wrote:
Jens-Christian,
how did you test that? Did you just tried to write to themsimultaneously? Any other tests that one can perform to verify that?
In our installation we have a VM with 30 RBD volumes mounted whichare all exported via NFS to other VMs.No one has complaint for the moment but the load/usage is veryminimal.If this problem really exists then very soon that the trial phasewill be over we will have millions of complaints :-(
What version of QEMU are you using? We are using the one providedby Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
Best regards,

George
I think we (i.e. Christian) found the problem:
We created a test VM with 9 mounted RBD volumes (no NFS server).Assoon as he hit all disks, we started to experience these 120second
timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:
Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and arelikely
suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.
Ceph logs will only pipe up when things have been blocked formore
than 30
seconds, NFS might take offense to lower values (or theaccumulation
of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on newstorage
nodes
(so could there be something different with those nodes?), howbusy
is your
network, CPU.
Running something like collectd to gather all ceph perf data and
other
data from the storage nodes and then feeding it to graphite (or
similar)
can be VERY helpful to identify if something is going wrong andwhat
it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU,
network,
specific HDDs/OSDs are bottlenecks.

Deep scrubbing can be _very_ taxing, do your problems persist if
inject
into your running cluster an "osd_scrub_sleep" value of "0.5"(lower
that
until it hurts again) or if you turn of deep scrubs altogetherfor
the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
We see something very similar on our Ceph cluster, starting asof
today.

We use a 16 node, 102 OSD Ceph installation as the basis for an
Icehouse
OpenStack cluster (we applied the RBD patches for livemigration
etc)
On this cluster we have a big ownCloud installation (Sync &Share)
that
stores its files on three NFS servers, each mounting 6 2TB RBD
volumes
and exposing them to around 10 web server VMs (we originally
started
with one NFS server with a 100TB volume, but that has become
unwieldy).
All of the servers (hypervisors, ceph storage nodes and VMs)are
using
Ubuntu 14.04
Yesterday evening we added 23 ODSs to the cluster bringing itup
to 125
OSDs (because we had 4 OSDs that were nearing the 90% fullmark).
The
rebalancing process ended this morning (after around 12 hours)The
cluster has been clean since then:

cluster b1f3f4c8-xxxxx
health HEALTH_OK
monmap e2: 3 mons at
{zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025osdmap
e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
pools,
82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TBavail
3319
active+clean 17 active+clean+scrubbing+deep
client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
At midnight, we run a script that creates an RBD snapshot ofall
RBD
volumes that are attached to the NFS servers (for backup
purposes).
Looking at our monitoring, around that time, one of the NFS
servers
became unresponsive and took down the complete ownCloud
installation
(load on the web server was > 200 and they had lost some of the
NFS
mounts)
Rebooting the NFS server solved that problem, but the NFSkernel
server
kept crashing all day long after having run between 10 to 90
minutes.
We initially suspected a corrupt rbd volume (as it seemed thatwe
could
trigger the kernel crash by just “ls -l” one of the volumes,
but
subsequent “xfs_repair -n” checks on those RBD volumes showed
no
problems.

We migrated the NFS server off of its hypervisor, suspecting a
problem
with RBD kernel modules, rebooted the hypervisor but theproblempersisted (both on the new hypervisor, and on the old one whenwe
migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256
servers
(even though the defaults had been working fine for over ayear)
Only one of our 3 NFS servers crashes (see below for syslog
information)
- the other 2 have been fine

May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
directory May
23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting
90-second
grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28
drive-nfs1
kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter CoreTeam
May
23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrackversion
0.5.0
(16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
[ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May2321:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -vdebian-sa1
/dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
collectd[1872]: python: Plugin loaded but not configured. May23
21:45:17 drive-nfs1 collectd[1872]: Initialization complete,
entering
read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283]
init:
plymouth-upstart-bridge main process ended, respawning May 23
21:51:26
drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked
for
more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
[ 600.779507] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs"
disables this message. May 23 21:51:26 drive-nfs1 kernel: [
600.781504]
nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 2321:51:26drive-nfs1 kernel: [ 600.781511] 00000000000131800000000000013180
ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1
kernel:
[ 600.781513] ffff880035f48244 ffff8800b22f980000000000ffffffffffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [600.781515]
Call
Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
[] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
drive-nfs1 kernel: [ 600.781526] []
__mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1
kernel:
[ 600.781528] [] mutex_lock+0x1f/0x2f May 23
21:51:26 drive-nfs1 kernel: [ 600.781557] []
nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1
kernel:
[ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [ 600.781591] []
nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1kernel:
[ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [ 600.781628] []
nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26drive-nfs1
kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200
[nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
[] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
21:51:26 drive-nfs1 kernel: [ 600.781678] []
svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1
kernel:
[ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23
21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?
nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1kernel:
[ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26
drive-nfs1 kernel: [ 600.781707] [] ?
kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1
kernel:
[ 600.781712] [] ret_from_fork+0x58/0x90 May 23
21:51:26 drive-nfs1 kernel: [ 600.781717] [] ?
kthread_create_on_node+0x1c0/0x1c0

Before each crash, we see the disk utilization of one or two
random
mounted RBD volumes to go to 100% - there is no pattern onwhich
of the
RBD disks start to act up.
We have scoured the log files of the Ceph cluster for any signsof
problems but came up empty.
The NFS server has almost no load (compared to regular usage)as
most
sync clients are either turned off (weekend) or have given up
connecting
to the server.

There haven't been any configuration change on the NFS servers
prior to
the problems. The only change was the adding of 23 OSDs.

We use ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3)

Our team is completely out of ideas. We have removed the 100TB
volume
from the nfs server (we used the downtime to migrate the lastdata
off
of it to one of the smaller volumes). The NFS server has been
running
for 30 minutes now (with close to no load) but we don’t really
expect it
to make it until tomorrow.

send help
Jens-Christian
--
Christian Balzer Network/Systems Engineer
ch...@gol.com [1] Global OnLine Japan/Fusion Communications
http://www.gol.com/ [2]
Links:
------
[1] mailto:ch...@gol.com
[2] http://www.gol.com/
[3] mailto:jens-christian.fisc...@switch.ch
[4] mailto:ch...@gol.com
--

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS interaction with RBD

Reply via email to