Hi Christian

This sounds like the same problem we are having. We get long wait times on ceph nodes, with certain commands (in our case, mainly mkfs) blocking for long periods of time, stuck in a wait (and not read or write) state. We get the same warning messages in syslog, as well.

Jeff

On 04/21/2015 04:31 AM, Christian Eichelmann wrote:
Hi Dan,

nope, we have no iptables rules on those hosts and the gateway is on the
same subnet as the ceph cluster.

I will see if I can find some informations on how to debug the rbd
kernel module (any suggestions are appreciated :))

Regards,
Christian

Am 21.04.2015 um 10:20 schrieb Dan van der Ster:
Hi Christian,

I've never debugged the kernel client either, so I don't know how to
increase debugging. (I don't see any useful parms on the kernel
modules).

Your log looks like the client just stops communicating with the ceph
cluster. Is iptables getting in the way ?

Cheers, Dan

On Tue, Apr 21, 2015 at 9:13 AM, Christian Eichelmann
<[email protected]> wrote:
Hi Dan,

we are alreay back on the kernel module since the same problems were
happening with fuse. I had no special ulimit settings for the
fuse-process, so that could have been an issue there.

I was pasting you the kernel messages during such incidents here:
http://pastebin.com/X5JRe1v3

I was never debugging the kernel client. Can you give me a short hint
how to increase the debug level and where the logs will be written to?

Regards,
Christian

Am 20.04.2015 um 15:50 schrieb Dan van der Ster:
Hi,
This is similar to what you would observe if you hit the ulimit on
open files/sockets in a Ceph client. Though that normally only affects
clients in user mode, not the kernel. What are the ulimits of your
rbd-fuse client? Also, you could increase the client logging debug
levels to see why the client is hanging. When the kernel rbd client
was hanging, was there anything printed to dmesg ?
Cheers, Dan

On Mon, Apr 20, 2015 at 9:29 AM, Christian Eichelmann
<[email protected]> wrote:
Hi Ceph-Users!

We currently have a problem where I am not sure if the it has it's cause
in Ceph or something else. First, some information about our ceph-setup:

* ceph version 0.87.1
* 5 MON
* 12 OSD with 60x2TB each
* 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian
Wheezy)

Our cluster is mainly used to store Log-Files from numerous servers via
RSync and make them available via RSync as well. Since about two weeks
we have a very strange behaviour and our RSync Gateways (they just map
several rbd devices and "export" them via rsyncd): The IO Wait on the
systems are increasing untill some of the cores getting stuck with an IO
Wait of 100%. RSync processes become zombies (defunct) and/or can not be
killed even with SIGKILL. After the system has reached a load of about
1400, it becomes totally unresponsive and the only way to "fix" the
problem is to reboot the system.

I was trying to manually reproduce the problem by simultainously reading
and writing from several machine, but the problem didn't appear.

I have no idea where the error can be. I was doing a ceph tell osd.*
bench during the problem and all osds where having normal benchmark
results. Has anyone an idea how this can happen? If you need any more
informations, please let me know.

Regards,
Christian

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
[email protected]

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to