Re: [ceph-users] VMs freez after slow requests

2013-06-04 Thread Dominik Mostowiec
Hi,
VM died, but on root disk i found:
kern.log:
51 2013-06-04T21:18:02.568823+02:00 vm-1 kernel - - - [  220.717935]
sd 2:0:0:0: Attached scsi generic sg0 type 0
51 2013-06-04T21:18:02.568848+02:00 vm-1 kernel - - - [  220.718231]
sd 2:0:0:0: [sda] 1048576000 512-byte logical blocks: (536 GB/500 GiB)
51 2013-06-04T21:18:02.568848+02:00 vm-1 kernel - - - [  220.718644]
sd 2:0:0:0: [sda] Write Protect is off
71 2013-06-04T21:18:02.568848+02:00 vm-1 kernel - - - [  220.718648]
sd 2:0:0:0: [sda] Mode Sense: 63 00 00 08
51 2013-06-04T21:18:02.568848+02:00 vm-1 kernel - - - [  220.718831]
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
61 2013-06-04T21:18:02.572829+02:00 vm-1 kernel - - - [  220.720405]
 sda: unknown partition table
51 2013-06-04T21:18:02.572850+02:00 vm-1 kernel - - - [  220.721593]
sd 2:0:0:0: [sda] Attached SCSI disk
51 2013-06-04T21:18:23.492939+02:00 vm-1 kernel - - - [  241.642855]
XFS (sda): Mounting Filesystem
61 2013-06-04T21:18:23.540894+02:00 vm-1 kernel - - - [  241.688141]
XFS (sda): Ending clean mount
41 2013-06-04T21:19:51.270529+02:00 vm-1 kernel - - - [  329.413347]
hrtimer: interrupt took 8993506 ns
41 2013-06-04T21:21:40.732930+02:00 vm-1 kernel - - - [  438.880340]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:21:45.732920+02:00 vm-1 kernel - - - [  443.880107]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:21:45.732981+02:00 vm-1 kernel - - - [  443.880117]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:21:50.732890+02:00 vm-1 kernel - - - [  448.880642]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:21:50.732956+02:00 vm-1 kernel - - - [  448.880655]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:21:55.732930+02:00 vm-1 kernel - - - [  453.880202]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:21:55.732992+02:00 vm-1 kernel - - - [  453.880212]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:22:00.732916+02:00 vm-1 kernel - - - [  458.880280]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:22:00.732979+02:00 vm-1 kernel - - - [  458.880291]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:22:05.732910+02:00 vm-1 kernel - - - [  463.880200]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:22:05.732975+02:00 vm-1 kernel - - - [  463.880211]
sd 2:0:0:0: [sda] ABORT operation started
41 2013-06-04T21:22:10.732928+02:00 vm-1 kernel - - - [  468.881404]
sd 2:0:0:0: ABORT operation timed-out.
41 2013-06-04T21:22:10.732989+02:00 vm-1 kernel - - - [  468.881414]
sd 2:0:0:0: [sda] ABORT operation started

ceph -w
   health HEALTH_ERR 2 pgs inconsistent; 6 pgs peering; 1 pgs repair;
6 pgs stuck inactive; 11 pgs stuck unclean; 3 scrub errors
   monmap e9: 5 mons at
{0=10.177.67.4:6782/0,1=10.177.67.5:6782/0,3=10.177.67.7:6782/0,4=10.177.67.8:6782/0,5=10.177.67.9:6782/0},
election epoch 2612, quorum 0,1,2,3,4 0,1,3,4,5
   osdmap e12006: 156 osds: 156 up, 156 in
pgmap v1120920: 18306 pgs: 5 active, 18293 active+clean, 6
peering, 1 active+clean+inconsistent, 1
active+clean+scrubbing+deep+inconsistent+repair; 1044 GB data, 4773 GB
used, 38647 GB / 43420 GB avail
   mdsmap e1: 0/0/1 up

2013-06-04 21:22:58.459901 mon.0 [INF] pgmap v1120919: 18306 pgs: 5
active, 18293 active+clean, 6 peering, 2 active+clean+inconsistent;
1044 GB data, 4773 GB used, 38647 GB / 43420 GB avail
2013-06-04 21:22:59.483844 mon.0 [INF] pgmap v1120920: 18306 pgs: 5
active, 18293 active+clean, 6 peering, 1 active+clean+inconsistent, 1
active+clean+scrubbing+deep+inconsistent+repair; 1044 GB data, 4773 GB
used, 38647 GB / 43420 GB avail
2013-06-04 21:22:54.835243 osd.91 [WRN] 5 slow requests, 1 included
below; oldest blocked for  4510.528973 secs
2013-06-04 21:22:54.835256 osd.91 [WRN] slow request 4510.528973
seconds old, received at 2013-06-04 20:07:44.306200:
osd_op(client.12947699.0:7466 rb.0.c5895a.238e1f29.1d24
[delete] 3.695f3c2a e12006) v4 currently reached pg
2013-06-04 21:22:55.835495 osd.91 [WRN] 5 slow requests, 1 included
below; oldest blocked for  4511.529224 secs
2013-06-04 21:22:55.835500 osd.91 [WRN] slow request 4511.529224
seconds old, received at 2013-06-04 20:07:44.306200:
osd_op(client.12947699.0:7466 rb.0.c5895a.238e1f29.1d24
[delete] 3.695f3c2a e12006) v4 currently reached pg
2013-06-04 21:22:56.835712 osd.91 [WRN] 5 slow requests, 1 included
below; oldest blocked for  4512.529440 secs
2013-06-04 21:22:56.835717 osd.91 [WRN] slow request 4512.529440
seconds old, received at 2013-06-04 20:07:44.306200:
osd_op(client.12947699.0:7466 rb.0.c5895a.238e1f29.1d24
[delete] 3.695f3c2a e12006) v4 currently reached pg
2013-06-04 21:22:57.835956 osd.91 [WRN] 5 slow requests, 1 included
below; oldest blocked for  4513.529679 secs
2013-06-04 21:22:57.835961 osd.91 [WRN] slow request 4513.529679
seconds old, received at 2013-06-04 20:07:44.306200:
osd_op(client.12947699.0:7466 rb.0.c5895a.238e1f29.1d24
[delete] 3.695f3c2a e12006) v4 

[ceph-users] VMs freez after slow requests

2013-06-03 Thread Dominik Mostowiec
Hi,
I try to start postgres cluster on VMs with second disk mounted from
ceph (rbd - kvm).
I started some writes (pgbench initialisation) on 8 VMs and VMs freez.
Ceph reports slow request on 1 osd. I restarted this osd to remove
slows and VMs hangs permanently.
Is this a normal situation afer cluster problems?

Setup:
6 hosts x 26 osd
ceph version 0.61.2
kvm 1.2 ( librbd version 0.61.2 )

--
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Gregory Farnum
On Sunday, June 2, 2013, Dominik Mostowiec wrote:

 Hi,
 I try to start postgres cluster on VMs with second disk mounted from
 ceph (rbd - kvm).
 I started some writes (pgbench initialisation) on 8 VMs and VMs freez.
 Ceph reports slow request on 1 osd. I restarted this osd to remove
 slows and VMs hangs permanently.
 Is this a normal situation afer cluster problems?


Definitely not. Is your cluster reporting as healthy (what's ceph -s
say)? Can you get anything off your hung VMs (like dmesg output)?
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Olivier Bonvalet

Le lundi 03 juin 2013 à 08:04 -0700, Gregory Farnum a écrit :
 On Sunday, June 2, 2013, Dominik Mostowiec wrote:
 Hi,
 I try to start postgres cluster on VMs with second disk
 mounted from
 ceph (rbd - kvm).
 I started some writes (pgbench initialisation) on 8 VMs and
 VMs freez.
 Ceph reports slow request on 1 osd. I restarted this osd to
 remove
 slows and VMs hangs permanently.
 Is this a normal situation afer cluster problems?
 
 
 Definitely not. Is your cluster reporting as healthy (what's ceph -s
 say)? Can you get anything off your hung VMs (like dmesg output)?
 -Greg
 
 
 -- 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

I also see that with Xen and kernel RBD client, when the ceph cluster
was full : in fact after some errors the block device switch in
read-only mode, and I didn't find any way to fix that (mount -o
remount,rw doesn't work). I had to reboot all the VM.

But since I don't have to unmap/remap RBD device, I don't think it's a
Ceph/RBD problem. Probably a Xen or Linux feature.

Olivier





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com