[ceph-users] VMs freez after slow requests

2013-06-03 Thread Dominik Mostowiec
Hi,
I try to start postgres cluster on VMs with second disk mounted from
ceph (rbd - kvm).
I started some writes (pgbench initialisation) on 8 VMs and VMs freez.
Ceph reports slow request on 1 osd. I restarted this osd to remove
slows and VMs hangs permanently.
Is this a normal situation afer cluster problems?

Setup:
6 hosts x 26 osd
ceph version 0.61.2
kvm 1.2 ( librbd version 0.61.2 )

--
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Gregory Farnum
On Sunday, June 2, 2013, Dominik Mostowiec wrote:

 Hi,
 I try to start postgres cluster on VMs with second disk mounted from
 ceph (rbd - kvm).
 I started some writes (pgbench initialisation) on 8 VMs and VMs freez.
 Ceph reports slow request on 1 osd. I restarted this osd to remove
 slows and VMs hangs permanently.
 Is this a normal situation afer cluster problems?


Definitely not. Is your cluster reporting as healthy (what's ceph -s
say)? Can you get anything off your hung VMs (like dmesg output)?
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Olivier Bonvalet

Le lundi 03 juin 2013 à 08:04 -0700, Gregory Farnum a écrit :
 On Sunday, June 2, 2013, Dominik Mostowiec wrote:
 Hi,
 I try to start postgres cluster on VMs with second disk
 mounted from
 ceph (rbd - kvm).
 I started some writes (pgbench initialisation) on 8 VMs and
 VMs freez.
 Ceph reports slow request on 1 osd. I restarted this osd to
 remove
 slows and VMs hangs permanently.
 Is this a normal situation afer cluster problems?
 
 
 Definitely not. Is your cluster reporting as healthy (what's ceph -s
 say)? Can you get anything off your hung VMs (like dmesg output)?
 -Greg
 
 
 -- 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

I also see that with Xen and kernel RBD client, when the ceph cluster
was full : in fact after some errors the block device switch in
read-only mode, and I didn't find any way to fix that (mount -o
remount,rw doesn't work). I had to reboot all the VM.

But since I don't have to unmap/remap RBD device, I don't think it's a
Ceph/RBD problem. Probably a Xen or Linux feature.

Olivier





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph killed by OS because of OOM under high load

2013-06-03 Thread Gregory Farnum
On Mon, Jun 3, 2013 at 8:47 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
 Hi,
 As my previous mail reported some weeks ago ,we are suffering from 
 OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue 
 really stop us from digging further into ceph characterization.
 Good news is that we seems find out the cause, I explain our 
 experiments below:

 Environment:
 We have 2 machines, one for client and one for ceph, 
 connected via 10GbE.
 The client machine is very powerful, with 64 Cores and 256G 
 RAM.
 The ceph machine with 32 Cores and 64G RAM, but we limited 
 the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 
 RPM 1T disk , 4* DCS 3700 SSDs as journals.
 Both client and ceph are v0.61.2.
 We run 12 rados bench instances in client node as a stress to 
 ceph node, each instance with 256 concurrent.
 Experiment and result:
 1.default ceph + default client ,   OK
 2.tuned ceph  + default clientFAIL,One osd killed by OS 
 due to OOM, and all swap space is run out. (tuning: Large queue ops/Large 
 queue bytes/.No flusher/sync_flush =true)
 3.tuned ceph WITHOUT large queue bytes  + default client   OK
 4.tuned ceph WITHOUT large queue bytes  + aggressive client  
 FAILED, One osd killed by OOM and one suicide because 150s op thread timeout. 
  (aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both 
 set to 10X of default)

 Conclusion.
 We would like to say,
 a.  under heavy load, some tuning will make ceph unstable 
 ,especially queue bytes related ( deduce from 1+2+3)
 b.  Ceph doesn't do any control on the lenth of OSD 
 Queue, this is a critical issue, with aggressive client or a lot of 
 concurrent clients, the osd queue will become too long to fit in memory ,thus 
 result in osd daemon being killed.(deduce from 3+4)
 c.   An observation to osd daemon memory usage show that, if 
 I use killall rados to kill all the rados bench instances, the ceph osd 
 daemon cannot free the allocated memory, instead, it still remain very high 
 memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , 
 if killed rados, it still remain 5~6GB, restart ceph can solve this issue)

You don't have enough RAM for your OSDs. We really recommend 1-2GB per
daemon; 600MB/daemon is dangerous. You might be able to make it work,
but you'll definitely need to change the queue lengths and things.
Speaking of which...yes, the OSDs do control their queue lengths, but
it's not dynamic tuning and by default it will let clients stack up
500MB of in-progress writes. With such wimpy systems you'll want to
turn that down, probably alongside various journal and disk wait
queues.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS has been repeatedly laggy or crashed

2013-06-03 Thread Gregory Farnum
On Sat, Jun 1, 2013 at 7:50 PM, MinhTien MinhTien
tientienminh080...@gmail.com wrote:
 Hi all.
 I have 3 server(use ceph 0.56.6):
 1 server user for Mon  mds.0
 1 server run OSD deamon ( Raid 6 (44TB) = OSD.0 )  mds.1
 1 server run OSD daemon ( Raid 6 (44TB) = OSD.1 )  mds.2

 When running ceph system, MDSs has been repeatedly ''laggy or crashed, 2
 times in 1 minute, and then, MDS reconnect and come back active.

Do you have logs from the MDS that actually crashed? What you have
here doesn't really tell us anything, except that apparently the disk
state is okay (that's good).


 I set max mds active = 1 == This error is generated
 I set max mds active = 2 == This error is generated

Hmm, you don't really want to fiddle with this setting. Using multiple
active MDSes is much less stable than one. Why were you fiddling with
it to begin with?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy

2013-06-03 Thread John Wilkins
Actually, as I said, I unmounted them first, zapped the disk, then
used OSD create. For you, that might look like:

sudo umount /dev/sda3
ceph-deploy disk zap  ceph0:sda3 ceph1:sda3 ceph2:sda3
ceph-deploy osd create ceph0:sda3 ceph1:sda3 ceph2:sda3

I was referring to the entire disk in my deployment, but I wasn't
using partitions on the same disk. So ceph-deploy created the data and
journal partitions for me. If you are running multiple OSDs on the
same disk (not recommended, except for evaluation), you'd want to use
the following procedure:


On Sat, Jun 1, 2013 at 7:57 AM, Dewan Shamsul Alam
dewan.sham...@gmail.com wrote:
 Hi John,

 I have a feeling that I am missing something. Previously when I succeeded
 with bobtail with mkcephfs, I mounted the /dev/sdb1 partitions. There is
 nothing mentioned in the blog about it though.

 Say I have 3 nodes ceph201 ceph202 and ceph 203. Each has a /dev/sdb1
 partition formatted as xfs. Do I need to mount them in a particular
 directory prior running the command or ceph-deploy would take care of it?


 On Thu, May 30, 2013 at 8:17 PM, John Wilkins john.wilk...@inktank.com
 wrote:

 Dewan,

 I encountered this too. I just did umount and reran the command and it
 worked for me. I probably need to add a troubleshooting section for
 ceph-deploy.

 On Fri, May 24, 2013 at 4:00 PM, John Wilkins john.wilk...@inktank.com
 wrote:
  ceph-deploy does have an ability to push the client keyrings. I
  haven't encountered this as a problem. However, I have created a
  monitor and not seen it return a keyring. In other words, it failed
  but didn't give me a warning message. So I just re-executed creating
  the monitor. The directory from where you execute ceph-deploy mon
  create should have a ceph.client.admin.keyring too. If it doesn't,
  you might have had a problem creating the monitor. I don't believe you
  have to push the ceph.client.admin.keyring to all the nodes. So it
  shouldn't be barking back unless you failed to create the monitor, or
  if gatherkeys failed.
 
  On Thu, May 23, 2013 at 9:09 PM, Dewan Shamsul Alam
  dewan.sham...@gmail.com wrote:
  I just found that
 
  #ceph-deploy gatherkeys ceph0 ceph1 ceph2
 
  works only if I have bobtail. cuttlefish can't find ceph.client.admin.
  keyring
 
  and then when I try this on bobtail, it says,
 
  root@cephdeploy:~/12.04# ceph-deploy osd create ceph0:/dev/sda3
  ceph1:/dev/sda3 ceph2:/dev/sda3
  ceph-disk: Error: Device is mounted: /dev/sda3
  Traceback (most recent call last):
File /usr/bin/ceph-deploy, line 22, in module
  main()
File /usr/lib/pymodules/python2.7/ceph_deploy/cli.py, line 112, in
  main
  return args.func(args)
File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 293, in
  osd
  prepare(args, cfg, activate_prepared_disk=True)
File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 177, in
  prepare
  dmcrypt_dir=args.dmcrypt_key_dir,
File /usr/lib/python2.7/dist-packages/pushy/protocol/proxy.py, line
  255,
  in lambda
  (conn.operator(type_, self, args, kwargs))
File /usr/lib/python2.7/dist-packages/pushy/protocol/connection.py,
  line
  66, in operator
  return self.send_request(type_, (object, args, kwargs))
File
  /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py,
  line 323, in send_request
  return self.__handle(m)
File
  /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py,
  line 639, in __handle
  raise e
  pushy.protocol.proxy.ExceptionProxy: Command '['ceph-disk-prepare',
  '--',
  '/dev/sda3']' returned non-zero exit status 1
  root@cephdeploy:~/12.04#
 
 
 
 
  On Thu, May 23, 2013 at 10:49 PM, Dewan Shamsul Alam
  dewan.sham...@gmail.com wrote:
 
  Hi,
 
  I tried ceph-deploy all day. Found that it has a python-setuptools as
  dependency. I knew about python-pushy. But is there any other
  dependency
  that I'm missing?
 
  The problem I'm getting are as follows:
 
  #ceph-deploy gatherkeys ceph0 ceph1 ceph2
  returns the following error,
  Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph0',
  'ceph1',
  'ceph2']
 
  Once I got passed this, I don't know why it works sometimes. I have
  been
  following the exact steps as mentioned in the blog.
 
  Then when I try to do
 
  ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3
 
  It gets stuck.
 
  I'm using Ubuntu 13.04 for ceph-deploy and 12.04 for ceph nodes. I
  just
  need to get the cuttlefish working and willing to change the OS if it
  is
  required. Please help. :)
 
  Best Regards,
  Dewan Shamsul Alam
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
  --
  John Wilkins
  Senior Technical Writer
  Intank
  john.wilk...@inktank.com
  (415) 425-9599
  http://inktank.com



 --
 John Wilkins
 Senior Technical Writer
 Intank
 john.wilk...@inktank.com
 (415) 425-9599
 

Re: [ceph-users] ceph-deploy

2013-06-03 Thread John Wilkins
Sorry...hit send inadvertantly...

http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only

On Mon, Jun 3, 2013 at 1:00 PM, John Wilkins john.wilk...@inktank.com wrote:
 Actually, as I said, I unmounted them first, zapped the disk, then
 used OSD create. For you, that might look like:

 sudo umount /dev/sda3
 ceph-deploy disk zap  ceph0:sda3 ceph1:sda3 ceph2:sda3
 ceph-deploy osd create ceph0:sda3 ceph1:sda3 ceph2:sda3

 I was referring to the entire disk in my deployment, but I wasn't
 using partitions on the same disk. So ceph-deploy created the data and
 journal partitions for me. If you are running multiple OSDs on the
 same disk (not recommended, except for evaluation), you'd want to use
 the following procedure:


 On Sat, Jun 1, 2013 at 7:57 AM, Dewan Shamsul Alam
 dewan.sham...@gmail.com wrote:
 Hi John,

 I have a feeling that I am missing something. Previously when I succeeded
 with bobtail with mkcephfs, I mounted the /dev/sdb1 partitions. There is
 nothing mentioned in the blog about it though.

 Say I have 3 nodes ceph201 ceph202 and ceph 203. Each has a /dev/sdb1
 partition formatted as xfs. Do I need to mount them in a particular
 directory prior running the command or ceph-deploy would take care of it?


 On Thu, May 30, 2013 at 8:17 PM, John Wilkins john.wilk...@inktank.com
 wrote:

 Dewan,

 I encountered this too. I just did umount and reran the command and it
 worked for me. I probably need to add a troubleshooting section for
 ceph-deploy.

 On Fri, May 24, 2013 at 4:00 PM, John Wilkins john.wilk...@inktank.com
 wrote:
  ceph-deploy does have an ability to push the client keyrings. I
  haven't encountered this as a problem. However, I have created a
  monitor and not seen it return a keyring. In other words, it failed
  but didn't give me a warning message. So I just re-executed creating
  the monitor. The directory from where you execute ceph-deploy mon
  create should have a ceph.client.admin.keyring too. If it doesn't,
  you might have had a problem creating the monitor. I don't believe you
  have to push the ceph.client.admin.keyring to all the nodes. So it
  shouldn't be barking back unless you failed to create the monitor, or
  if gatherkeys failed.
 
  On Thu, May 23, 2013 at 9:09 PM, Dewan Shamsul Alam
  dewan.sham...@gmail.com wrote:
  I just found that
 
  #ceph-deploy gatherkeys ceph0 ceph1 ceph2
 
  works only if I have bobtail. cuttlefish can't find ceph.client.admin.
  keyring
 
  and then when I try this on bobtail, it says,
 
  root@cephdeploy:~/12.04# ceph-deploy osd create ceph0:/dev/sda3
  ceph1:/dev/sda3 ceph2:/dev/sda3
  ceph-disk: Error: Device is mounted: /dev/sda3
  Traceback (most recent call last):
File /usr/bin/ceph-deploy, line 22, in module
  main()
File /usr/lib/pymodules/python2.7/ceph_deploy/cli.py, line 112, in
  main
  return args.func(args)
File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 293, in
  osd
  prepare(args, cfg, activate_prepared_disk=True)
File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 177, in
  prepare
  dmcrypt_dir=args.dmcrypt_key_dir,
File /usr/lib/python2.7/dist-packages/pushy/protocol/proxy.py, line
  255,
  in lambda
  (conn.operator(type_, self, args, kwargs))
File /usr/lib/python2.7/dist-packages/pushy/protocol/connection.py,
  line
  66, in operator
  return self.send_request(type_, (object, args, kwargs))
File
  /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py,
  line 323, in send_request
  return self.__handle(m)
File
  /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py,
  line 639, in __handle
  raise e
  pushy.protocol.proxy.ExceptionProxy: Command '['ceph-disk-prepare',
  '--',
  '/dev/sda3']' returned non-zero exit status 1
  root@cephdeploy:~/12.04#
 
 
 
 
  On Thu, May 23, 2013 at 10:49 PM, Dewan Shamsul Alam
  dewan.sham...@gmail.com wrote:
 
  Hi,
 
  I tried ceph-deploy all day. Found that it has a python-setuptools as
  dependency. I knew about python-pushy. But is there any other
  dependency
  that I'm missing?
 
  The problem I'm getting are as follows:
 
  #ceph-deploy gatherkeys ceph0 ceph1 ceph2
  returns the following error,
  Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph0',
  'ceph1',
  'ceph2']
 
  Once I got passed this, I don't know why it works sometimes. I have
  been
  following the exact steps as mentioned in the blog.
 
  Then when I try to do
 
  ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3
 
  It gets stuck.
 
  I'm using Ubuntu 13.04 for ceph-deploy and 12.04 for ceph nodes. I
  just
  need to get the cuttlefish working and willing to change the OS if it
  is
  required. Please help. :)
 
  Best Regards,
  Dewan Shamsul Alam
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
  --
  

Re: [ceph-users] replacing an OSD or crush map sensitivity

2013-06-03 Thread Chen, Xiaoxi
my 0.02, you really dont need to wait for health_ok between your recovery 
steps,just go ahead. Everytime a new map be generated and broadcasted,the old 
map and in-progress recovery will be canceled

发自我的 iPhone

在 2013-6-2,11:30,Nigel Williams nigel.d.willi...@gmail.com 写道:

 Could I have a critique of this approach please as to how I could have done 
 it better or whether what I experienced simply reflects work still to be done.
 
 This is with Ceph 0.61.2 on a quite slow test cluster (logs shared with OSDs, 
 no separate journals, using CephFS).
 
 I knocked the power cord out from a storage node taking down 4 of the hosted 
 OSDs, all but one came back ok. This is one OSD out of a total of 12 so 1/12 
 of the storage.
 
 Losing an OSD put the cluster into recovery, so all good. Next action was how 
 to get the missing (downed) OSD back online.
 
 The OSD was xfs based and so I had to throw away the xfs log to get it to 
 mount. Having done this and getting it re-mounted Ceph then started throwing 
 issue #4855 (I added dmesg and logs to that issue if it helps - I am wonder 
 if throwing away the xfs log caused an internal OSD inconsistency? and this 
 causes issue #4855?). Given that I could not recover this OSD as far as 
 Ceph is concerned I decided to delete and rebuild it.
 
 Several hours later, cluster was back to HEALTH_OK. I proceeded to remove and 
 re-add the bad OSD. I following the doc suggestions to do this.
 
 The problem is we each change, it caused a slight change in the crush map, 
 resulting in the cluster going back into recovery, adding several hours wait 
 for each change. I chose to wait until the cluster was back to HEALTH_OK 
 before doing the next step. Overall it has taken a few days to finally get a 
 single OSD back into the cluster.
 
 At one point during recovery the full threshold was triggered on a single OSD 
 causing the recovery to stop, doing ceph pg set_full_ratio 0.98 did not 
 help. I was not planning to add data to the cluster while doing recovery 
 operations and did not understand the suggestion the PGs could be deleted to 
 make space on a full OSD, so I expect raising the threshold was the best 
 option but it had no (immediate) effect.
 
 I am now back to having all 12 OSDs in and the hopefully final recovery under 
 way while it re-balances the OSDs, although I note I am still getting the 
 full OSD warning I am expecting this to disappear soon now that the 12th OSD 
 is back online.
 
 During this recovery the percentage degraded has been a little confusing. 
 While the 12th OSD was offline the percentages were around 15-20% IIRC. But 
 now I see the percentage is 35% and slowly dropping, not sure I understand 
 the ratios and why so high with a single missing OSD.
 
 A few documentation errors caused confusion too.
 
 This page still contains errors in the steps to create a new OSD (manually):
 
 http://eu.ceph.com/docs/wip-3060/cluster-ops/add-or-rm-osds/#adding-an-osd-manual
 
 ceph osd create {osd-num} should be ceph osd create
 
 
 and this:
 
 http://eu.ceph.com/docs/wip-3060/cluster-ops/crush-map/#addosd
 
 I had to put host= to get the command accepted.
 
 Suggestions and questions:
 
 1. Is there a way to get documentation pages fixed? or at least 
 health-warnings on them: This page badly needs updating since it is 
 wrong/misleading
 
 2. We need a small set of definitive succinct recipes that provide steps to 
 recover from common failures with a narrative around what to expect at each 
 step (your cluster will be in recovery here...).
 
 3. Some commands are throwing erroneous errors that are actually benign 
 :ceph-osd -i 10 --mkfs --mkkey complains about failures that are expected as 
 the OSD is initially empty.
 
 4. An easier way to capture the state of the cluster for analysis. I don't 
 feel confident that when asked for logs that I am giving the most useful 
 snippets or the complete story. It seems we need a tool that can gather all 
 this in a neat bundle for later dissection or forensics.
 
 5. Is there a more straightforward (faster) way getting an OSD back online. 
 It almost seems like it is worth having a standby OSD ready to step in and 
 assume duties (a hot spare?).
 
 6. Is there a way to make the crush map less sensitive to changes during 
 recovery operations? I would have liked to stall/slow recovery while I 
 replaced the OSD then let it run at full speed.
 
 Excuses:
 
 I'd be happy to action suggestions but my current level of Ceph understanding 
 is still too limited that effort on my part is unproductive; I am prodding 
 the community to see if there is consensus on the need.
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS + qemu-kvm rbd support update

2013-06-03 Thread YIP Wai Peng
Hi Andrel,

Have you tried the patched ones at
https://objects.dreamhost.com/rpms/qemu/qemu-kvm-0.12.1.2-2.355.el6.2.x86_64.rpmand
https://objects.dreamhost.com/rpms/qemu/qemu-img-0.12.1.2-2.355.el6.2.x86_64.rpm?

I got the links off the IRC chat, I'm using them now.

- WP


On Sun, Jun 2, 2013 at 8:41 AM, Andrei Mikhailovsky and...@arhont.comwrote:

 Hello guys,

 Was wondering if there are any news on the CentOS 6 qemu-kvm packages with
 rbd support? I am very keen to try it out.

 Thanks

 Andrei

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-1.4.2 rbd-fixed ubuntu packages

2013-06-03 Thread w sun
Thanks for your clarification. I don't have much in-depth knowledge of libvirt 
although I believe openstack does use it for scheduling nova compute jobs  
(initiating VM instances) and supporting live-migration, both of which work 
properly in our grizzly environment. I will keep an eye on this and report back 
if I see any similar issues.
--weiguo

 Date: Mon, 3 Jun 2013 08:10:51 +0200
 From: wolfgang.hennerbich...@risc-software.at
 To: ws...@hotmail.com
 CC: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] qemu-1.4.2 rbd-fixed ubuntu packages
 
 On Wed, May 29, 2013 at 04:16:14PM +0200, w sun wrote:
  Hi Wolfgang,
  
  Can you elaborate the issue for 1.5 with libvirt? Wonder if that will 
  impact the usage with Grizzly. Did a quick compile for 1.5 with RBD support 
  enabled, so far it seems to be ok for openstack with a few simple tests. 
  But definitely want to be cautious if there is known integration issue with 
  1.5.
  
  Thanks. --weiguo
 
 I basically couldn't make the vm boot with libvirt. Libvirt complained about 
 a missing monitor command (not ceph monitor, but kvm monitor file or 
 something). I didn't want to start upgrading libvirt too, so I stepped back 
 to 1.4.2. 
 
 Wolfgang
 
 
 -- 
 http://www.wogri.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread YIP Wai Peng
Hi all,

I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD).

When I increased one of my pool rep size from 2 to 3, just 6 PGs will get
stuck in active+clean+degraded mode, but it doesn't create new replicas.

One of the problematic PG has the following (snipped for brevity)

{ state: active+clean+degraded,
  epoch: 1329,
  up: [
4,
6],
  acting: [
4,
6],
snip
  recovery_state: [
{ name: Started\/Primary\/Active,
  enter_time: 2013-06-04 01:10:30.092977,
  might_have_unfound: [
{ osd: 3,
  status: already probed},
{ osd: 5,
  status: not queried},
{ osd: 6,
  status: already probed}],
snip


I tried force_create_pg but it gets stuck in creating. Any ideas on how
to kickstart this node to create the correct numbers of replicas?


PS: I have the following crush rule for the pool, which makes the replicas
go to different hosts.
host1 has OSD 0,1,2
host2 has OSD 3,4,5
host3 has OSD 6,7,8
Looking at it, the new replica should be going to OSD 0,1,2, but ceph is
not creating it?

rule different_host {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}


Any help will be much appreciated. Cheers
- Wai Peng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replacing an OSD or crush map sensitivity

2013-06-03 Thread Nigel Williams
On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote:
 my 0.02, you really dont need to wait for health_ok between your
 recovery steps,just go ahead. Everytime a new map be generated and
 broadcasted,the old map and in-progress recovery will be canceled

thanks Xiaoxi, that is helpful to know.

It seems to me that there might be a failure-mode (or race-condition?)
here though, as the cluster is now struggling to recover as the
replacement OSD caused the cluster to go into backfill_toofull.

The failure sequence might be:

1. From HEALTH_OK crash an OSD
2. Wait for recovery
3. Remove OSD using usual procedures
4. Wait for recovery
5. Add back OSD using usual procedures
6. Wait for recovery
7. Cluster is unable to recover due to toofull conditions

Perhaps this is a needed test case to round-trip a cluster through a
known failure/recovery scenario.

Note this is using a simplistically configured test-cluster with CephFS
in the mix and about 2.5 million files.

Something else I noticed: I restarted the cluster (and set the leveldb
compact option since I'd run out of space on the roots) and now I see it
is again making progress on the backfill. Seems odd that the cluster
pauses but a restart clears the pause, is that by design?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread Sage Weil
On Tue, 4 Jun 2013, YIP Wai Peng wrote:
 Hi all,
 I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD).
 When I increased one of my pool rep size from 2 to 3, just 6 PGs will get
 stuck in active+clean+degraded mode, but it doesn't create new replicas.

My first guess is that you do not have the newer crush tunables set and 
some placements are not quite right.  If you are prepared for some data 
migration, and are not using an older kernel client, try

 ceph osd crush tunables optimal

sage


 
 One of the problematic PG has the following (snipped for brevity) 
 
 { state: active+clean+degraded,
   epoch: 1329,
   up: [
         4,
         6],
   acting: [
         4,
         6],
 snip
   recovery_state: [
         { name: Started\/Primary\/Active,
           enter_time: 2013-06-04 01:10:30.092977,
           might_have_unfound: [
                 { osd: 3,
                   status: already probed},
                 { osd: 5,
                   status: not queried},
                 { osd: 6,
                   status: already probed}],
 snip
 
 
 I tried force_create_pg but it gets stuck in creating. Any ideas on how to
 kickstart this node to create the correct numbers of replicas?
 
 
 PS: I have the following crush rule for the pool, which makes the replicas
 go to different hosts. 
 host1 has OSD 0,1,2
 host2 has OSD 3,4,5
 host3 has OSD 6,7,8
 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not
 creating it?
 
 rule different_host {
         ruleset 3
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type host
         step emit
 }
 
 
 Any help will be much appreciated. Cheers
 - Wai Peng
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replacing an OSD or crush map sensitivity

2013-06-03 Thread Sage Weil
On Tue, 4 Jun 2013, Nigel Williams wrote:
 On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote:
  my 0.02? you really dont need to wait for health_ok between your
  recovery steps,just go ahead. Everytime a new map be generated and
  broadcasted,the old map and in-progress recovery will be canceled
 
 thanks Xiaoxi, that is helpful to know.
 
 It seems to me that there might be a failure-mode (or race-condition?)
 here though, as the cluster is now struggling to recover as the
 replacement OSD caused the cluster to go into backfill_toofull.
 
 The failure sequence might be:
 
 1. From HEALTH_OK crash an OSD
 2. Wait for recovery
 3. Remove OSD using usual procedures
 4. Wait for recovery
 5. Add back OSD using usual procedures
 6. Wait for recovery
 7. Cluster is unable to recover due to toofull conditions
 
 Perhaps this is a needed test case to round-trip a cluster through a
 known failure/recovery scenario.
 
 Note this is using a simplistically configured test-cluster with CephFS
 in the mix and about 2.5 million files.
 
 Something else I noticed: I restarted the cluster (and set the leveldb
 compact option since I'd run out of space on the roots) and now I see it
 is again making progress on the backfill. Seems odd that the cluster
 pauses but a restart clears the pause, is that by design?

Does the monitor data directory share a disk with an OSD?  If so, that 
makes sense: compaction freed enough space to drop below the threshold...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread YIP Wai Peng
Hi Sage,

It is on optimal tunables already. However, I'm on kernel
2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to
upgrade to something newer?

- WP


On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote:

 On Tue, 4 Jun 2013, YIP Wai Peng wrote:
  Hi all,
  I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD).
  When I increased one of my pool rep size from 2 to 3, just 6 PGs will get
  stuck in active+clean+degraded mode, but it doesn't create new replicas.

 My first guess is that you do not have the newer crush tunables set and
 some placements are not quite right.  If you are prepared for some data
 migration, and are not using an older kernel client, try

  ceph osd crush tunables optimal

 sage


 
  One of the problematic PG has the following (snipped for brevity)
 
  { state: active+clean+degraded,
epoch: 1329,
up: [
  4,
  6],
acting: [
  4,
  6],
  snip
recovery_state: [
  { name: Started\/Primary\/Active,
enter_time: 2013-06-04 01:10:30.092977,
might_have_unfound: [
  { osd: 3,
status: already probed},
  { osd: 5,
status: not queried},
  { osd: 6,
status: already probed}],
  snip
 
 
  I tried force_create_pg but it gets stuck in creating. Any ideas on
 how to
  kickstart this node to create the correct numbers of replicas?
 
 
  PS: I have the following crush rule for the pool, which makes the
 replicas
  go to different hosts.
  host1 has OSD 0,1,2
  host2 has OSD 3,4,5
  host3 has OSD 6,7,8
  Looking at it, the new replica should be going to OSD 0,1,2, but ceph is
 not
  creating it?
 
  rule different_host {
  ruleset 3
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
  }
 
 
  Any help will be much appreciated. Cheers
  - Wai Peng
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replacing an OSD or crush map sensitivity

2013-06-03 Thread Nigel Williams
On Tue, Jun 4, 2013 at 1:59 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 4 Jun 2013, Nigel Williams wrote:
 Something else I noticed: ...

 Does the monitor data directory share a disk with an OSD?  If so, that
 makes sense: compaction freed enough space to drop below the threshold...

 Of course! that is exactly it, thanks - scratch that last
observation, red herring.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread Wolfgang Hennerbichler
On Mon, Jun 03, 2013 at 08:58:00PM -0700, Sage Weil wrote:
 
 My first guess is that you do not have the newer crush tunables set and 
 some placements are not quite right.  If you are prepared for some data 
 migration, and are not using an older kernel client, try
 
  ceph osd crush tunables optimal

 One thing that I'm not quite sure about - in the documentation we learn: The 
ceph-osd and ceph-mon daemons will start requiring the feature bits of new 
connections as soon as they get the updated map. However, already-connected 
clients are effectively grandfathered in, and will misbehave if they do not 
support the new feature.

So: Am I in danger when I set this to optimal in a productive bobtail-cluster 
with qemu-rbd being the only client around? 
 
 sage

Wolfgang


-- 
http://www.wogri.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread YIP Wai Peng
Sorry, to set things in context, I had some other problems last weekend.
Setting it to optimal tunables helped (although I am on the older kernel).
Since it worked, I was inclined to believed that the tunables do work on
the older kernel.

That being said, I will upgrade the kernel to see if this issue goes away.

Regards,
Wai Peng


On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg wrote:

 Hi Sage,

 It is on optimal tunables already. However, I'm on kernel
 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to
 upgrade to something newer?

 - WP


 On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote:

 On Tue, 4 Jun 2013, YIP Wai Peng wrote:
  Hi all,
  I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD).
  When I increased one of my pool rep size from 2 to 3, just 6 PGs will
 get
  stuck in active+clean+degraded mode, but it doesn't create new replicas.

 My first guess is that you do not have the newer crush tunables set and
 some placements are not quite right.  If you are prepared for some data
 migration, and are not using an older kernel client, try

  ceph osd crush tunables optimal

 sage


 
  One of the problematic PG has the following (snipped for brevity)
 
  { state: active+clean+degraded,
epoch: 1329,
up: [
  4,
  6],
acting: [
  4,
  6],
  snip
recovery_state: [
  { name: Started\/Primary\/Active,
enter_time: 2013-06-04 01:10:30.092977,
might_have_unfound: [
  { osd: 3,
status: already probed},
  { osd: 5,
status: not queried},
  { osd: 6,
status: already probed}],
  snip
 
 
  I tried force_create_pg but it gets stuck in creating. Any ideas on
 how to
  kickstart this node to create the correct numbers of replicas?
 
 
  PS: I have the following crush rule for the pool, which makes the
 replicas
  go to different hosts.
  host1 has OSD 0,1,2
  host2 has OSD 3,4,5
  host3 has OSD 6,7,8
  Looking at it, the new replica should be going to OSD 0,1,2, but ceph
 is not
  creating it?
 
  rule different_host {
  ruleset 3
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
  }
 
 
  Any help will be much appreciated. Cheers
  - Wai Peng
 
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread Sage Weil
On Tue, 4 Jun 2013, Wolfgang Hennerbichler wrote:
 On Mon, Jun 03, 2013 at 08:58:00PM -0700, Sage Weil wrote:
  
  My first guess is that you do not have the newer crush tunables set and 
  some placements are not quite right.  If you are prepared for some data 
  migration, and are not using an older kernel client, try
  
   ceph osd crush tunables optimal
 
  One thing that I'm not quite sure about - in the documentation we learn: The 
 ceph-osd and ceph-mon daemons will start requiring the feature bits of new 
 connections as soon as they get the updated map. However, already-connected 
 clients are effectively grandfathered in, and will misbehave if they do not 
 support the new feature.
 
 So: Am I in danger when I set this to optimal in a productive bobtail-cluster 
 with qemu-rbd being the only client around? 

The tunables were added in v0.55 (just prior to bobtail), so you should be 
in good shape.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread Sage Weil
On Tue, 4 Jun 2013, YIP Wai Peng wrote:
 Sorry, to set things in context, I had some other problems last weekend.
 Setting it to optimal tunables helped (although I am on the older kernel).
 Since it worked, I was inclined to believed that the tunables do work on the
 older kernel.
 That being said, I will upgrade the kernel to see if this issue goes away.

The kernel version is only an issue if you are using the cephfs or rbd 
*client* from the kernel (e.g., rbd map ... or mount -t ceph ...).  (Ceph 
didn't appear upstream until 2.6.35 or thereabouts, and fixes are only 
backported as far as v3.4.)

sage

 
 Regards,
 Wai Peng
 
 
 On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg wrote:
   Hi Sage,
 It is on optimal tunables already. However, I'm on kernel
 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have
 to upgrade to something newer?
 
 - WP
 
 
 On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote:
   On Tue, 4 Jun 2013, YIP Wai Peng wrote:
Hi all,
I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each
   (total 9 OSD).
When I increased one of my pool rep size from 2 to 3,
   just 6 PGs will get
stuck in active+clean+degraded mode, but it doesn't
   create new replicas.
 
 My first guess is that you do not have the newer crush tunables
 set and
 some placements are not quite right.  If you are prepared for
 some data
 migration, and are not using an older kernel client, try
 
  ceph osd crush tunables optimal
 
 sage
 
 
 
  One of the problematic PG has the following (snipped for
 brevity) 
 
  { state: active+clean+degraded,
    epoch: 1329,
    up: [
          4,
          6],
    acting: [
          4,
          6],
  snip
    recovery_state: [
          { name: Started\/Primary\/Active,
            enter_time: 2013-06-04 01:10:30.092977,
            might_have_unfound: [
                  { osd: 3,
                    status: already probed},
                  { osd: 5,
                    status: not queried},
                  { osd: 6,
                    status: already probed}],
  snip
 
 
  I tried force_create_pg but it gets stuck in creating. Any
 ideas on how to
  kickstart this node to create the correct numbers of
 replicas?
 
 
  PS: I have the following crush rule for the pool, which makes
 the replicas
  go to different hosts. 
  host1 has OSD 0,1,2
  host2 has OSD 3,4,5
  host3 has OSD 6,7,8
  Looking at it, the new replica should be going to OSD 0,1,2,
 but ceph is not
  creating it?
 
  rule different_host {
          ruleset 3
          type replicated
          min_size 1
          max_size 10
          step take default
          step chooseleaf firstn 0 type host
          step emit
  }
 
 
  Any help will be much appreciated. Cheers
  - Wai Peng
 
 
 
 
 
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+clean+degraded, but not creating new replicas

2013-06-03 Thread YIP Wai Peng
Hi Sage,

Thanks, I noticed after re-reading the documentation.

I realized that osd.8 was not in host3. After adding osd.8 to host3, the
PGs are now in active+remapped

# ceph pg 3.45 query

{ state: active+remapped,
  epoch: 1374,
  up: [
4,
8],
  acting: [
4,
8,
6],
snip

Still, nothing is happening. What can be wrong?

- WP

On Tue, Jun 4, 2013 at 12:26 PM, Sage Weil s...@inktank.com wrote:

 On Tue, 4 Jun 2013, YIP Wai Peng wrote:
  Sorry, to set things in context, I had some other problems last weekend.
  Setting it to optimal tunables helped (although I am on the older
 kernel).
  Since it worked, I was inclined to believed that the tunables do work on
 the
  older kernel.
  That being said, I will upgrade the kernel to see if this issue goes
 away.

 The kernel version is only an issue if you are using the cephfs or rbd
 *client* from the kernel (e.g., rbd map ... or mount -t ceph ...).  (Ceph
 didn't appear upstream until 2.6.35 or thereabouts, and fixes are only
 backported as far as v3.4.)

 sage

 
  Regards,
  Wai Peng
 
 
  On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg
 wrote:
Hi Sage,
  It is on optimal tunables already. However, I'm on kernel
  2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have
  to upgrade to something newer?
 
  - WP
 
 
  On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote:
On Tue, 4 Jun 2013, YIP Wai Peng wrote:
 Hi all,
 I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each
(total 9 OSD).
 When I increased one of my pool rep size from 2 to 3,
just 6 PGs will get
 stuck in active+clean+degraded mode, but it doesn't
create new replicas.
 
  My first guess is that you do not have the newer crush tunables
  set and
  some placements are not quite right.  If you are prepared for
  some data
  migration, and are not using an older kernel client, try
 
   ceph osd crush tunables optimal
 
  sage
 
 
  
   One of the problematic PG has the following (snipped for
  brevity)
  
   { state: active+clean+degraded,
 epoch: 1329,
 up: [
   4,
   6],
 acting: [
   4,
   6],
   snip
 recovery_state: [
   { name: Started\/Primary\/Active,
 enter_time: 2013-06-04 01:10:30.092977,
 might_have_unfound: [
   { osd: 3,
 status: already probed},
   { osd: 5,
 status: not queried},
   { osd: 6,
 status: already probed}],
   snip
  
  
   I tried force_create_pg but it gets stuck in creating. Any
  ideas on how to
   kickstart this node to create the correct numbers of
  replicas?
  
  
   PS: I have the following crush rule for the pool, which makes
  the replicas
   go to different hosts.
   host1 has OSD 0,1,2
   host2 has OSD 3,4,5
   host3 has OSD 6,7,8
   Looking at it, the new replica should be going to OSD 0,1,2,
  but ceph is not
   creating it?
  
   rule different_host {
   ruleset 3
   type replicated
   min_size 1
   max_size 10
   step take default
   step chooseleaf firstn 0 type host
   step emit
   }
  
  
   Any help will be much appreciated. Cheers
   - Wai Peng
  
  
 
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com