[ceph-users] VMs freez after slow requests
Hi, I try to start postgres cluster on VMs with second disk mounted from ceph (rbd - kvm). I started some writes (pgbench initialisation) on 8 VMs and VMs freez. Ceph reports slow request on 1 osd. I restarted this osd to remove slows and VMs hangs permanently. Is this a normal situation afer cluster problems? Setup: 6 hosts x 26 osd ceph version 0.61.2 kvm 1.2 ( librbd version 0.61.2 ) -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VMs freez after slow requests
On Sunday, June 2, 2013, Dominik Mostowiec wrote: Hi, I try to start postgres cluster on VMs with second disk mounted from ceph (rbd - kvm). I started some writes (pgbench initialisation) on 8 VMs and VMs freez. Ceph reports slow request on 1 osd. I restarted this osd to remove slows and VMs hangs permanently. Is this a normal situation afer cluster problems? Definitely not. Is your cluster reporting as healthy (what's ceph -s say)? Can you get anything off your hung VMs (like dmesg output)? -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VMs freez after slow requests
Le lundi 03 juin 2013 à 08:04 -0700, Gregory Farnum a écrit : On Sunday, June 2, 2013, Dominik Mostowiec wrote: Hi, I try to start postgres cluster on VMs with second disk mounted from ceph (rbd - kvm). I started some writes (pgbench initialisation) on 8 VMs and VMs freez. Ceph reports slow request on 1 osd. I restarted this osd to remove slows and VMs hangs permanently. Is this a normal situation afer cluster problems? Definitely not. Is your cluster reporting as healthy (what's ceph -s say)? Can you get anything off your hung VMs (like dmesg output)? -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, I also see that with Xen and kernel RBD client, when the ceph cluster was full : in fact after some errors the block device switch in read-only mode, and I didn't find any way to fix that (mount -o remount,rw doesn't work). I had to reboot all the VM. But since I don't have to unmap/remap RBD device, I don't think it's a Ceph/RBD problem. Probably a Xen or Linux feature. Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph killed by OS because of OOM under high load
On Mon, Jun 3, 2013 at 8:47 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi, As my previous mail reported some weeks ago ,we are suffering from OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue really stop us from digging further into ceph characterization. Good news is that we seems find out the cause, I explain our experiments below: Environment: We have 2 machines, one for client and one for ceph, connected via 10GbE. The client machine is very powerful, with 64 Cores and 256G RAM. The ceph machine with 32 Cores and 64G RAM, but we limited the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 RPM 1T disk , 4* DCS 3700 SSDs as journals. Both client and ceph are v0.61.2. We run 12 rados bench instances in client node as a stress to ceph node, each instance with 256 concurrent. Experiment and result: 1.default ceph + default client , OK 2.tuned ceph + default clientFAIL,One osd killed by OS due to OOM, and all swap space is run out. (tuning: Large queue ops/Large queue bytes/.No flusher/sync_flush =true) 3.tuned ceph WITHOUT large queue bytes + default client OK 4.tuned ceph WITHOUT large queue bytes + aggressive client FAILED, One osd killed by OOM and one suicide because 150s op thread timeout. (aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both set to 10X of default) Conclusion. We would like to say, a. under heavy load, some tuning will make ceph unstable ,especially queue bytes related ( deduce from 1+2+3) b. Ceph doesn't do any control on the lenth of OSD Queue, this is a critical issue, with aggressive client or a lot of concurrent clients, the osd queue will become too long to fit in memory ,thus result in osd daemon being killed.(deduce from 3+4) c. An observation to osd daemon memory usage show that, if I use killall rados to kill all the rados bench instances, the ceph osd daemon cannot free the allocated memory, instead, it still remain very high memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , if killed rados, it still remain 5~6GB, restart ceph can solve this issue) You don't have enough RAM for your OSDs. We really recommend 1-2GB per daemon; 600MB/daemon is dangerous. You might be able to make it work, but you'll definitely need to change the queue lengths and things. Speaking of which...yes, the OSDs do control their queue lengths, but it's not dynamic tuning and by default it will let clients stack up 500MB of in-progress writes. With such wimpy systems you'll want to turn that down, probably alongside various journal and disk wait queues. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS has been repeatedly laggy or crashed
On Sat, Jun 1, 2013 at 7:50 PM, MinhTien MinhTien tientienminh080...@gmail.com wrote: Hi all. I have 3 server(use ceph 0.56.6): 1 server user for Mon mds.0 1 server run OSD deamon ( Raid 6 (44TB) = OSD.0 ) mds.1 1 server run OSD daemon ( Raid 6 (44TB) = OSD.1 ) mds.2 When running ceph system, MDSs has been repeatedly ''laggy or crashed, 2 times in 1 minute, and then, MDS reconnect and come back active. Do you have logs from the MDS that actually crashed? What you have here doesn't really tell us anything, except that apparently the disk state is okay (that's good). I set max mds active = 1 == This error is generated I set max mds active = 2 == This error is generated Hmm, you don't really want to fiddle with this setting. Using multiple active MDSes is much less stable than one. Why were you fiddling with it to begin with? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy
Actually, as I said, I unmounted them first, zapped the disk, then used OSD create. For you, that might look like: sudo umount /dev/sda3 ceph-deploy disk zap ceph0:sda3 ceph1:sda3 ceph2:sda3 ceph-deploy osd create ceph0:sda3 ceph1:sda3 ceph2:sda3 I was referring to the entire disk in my deployment, but I wasn't using partitions on the same disk. So ceph-deploy created the data and journal partitions for me. If you are running multiple OSDs on the same disk (not recommended, except for evaluation), you'd want to use the following procedure: On Sat, Jun 1, 2013 at 7:57 AM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: Hi John, I have a feeling that I am missing something. Previously when I succeeded with bobtail with mkcephfs, I mounted the /dev/sdb1 partitions. There is nothing mentioned in the blog about it though. Say I have 3 nodes ceph201 ceph202 and ceph 203. Each has a /dev/sdb1 partition formatted as xfs. Do I need to mount them in a particular directory prior running the command or ceph-deploy would take care of it? On Thu, May 30, 2013 at 8:17 PM, John Wilkins john.wilk...@inktank.com wrote: Dewan, I encountered this too. I just did umount and reran the command and it worked for me. I probably need to add a troubleshooting section for ceph-deploy. On Fri, May 24, 2013 at 4:00 PM, John Wilkins john.wilk...@inktank.com wrote: ceph-deploy does have an ability to push the client keyrings. I haven't encountered this as a problem. However, I have created a monitor and not seen it return a keyring. In other words, it failed but didn't give me a warning message. So I just re-executed creating the monitor. The directory from where you execute ceph-deploy mon create should have a ceph.client.admin.keyring too. If it doesn't, you might have had a problem creating the monitor. I don't believe you have to push the ceph.client.admin.keyring to all the nodes. So it shouldn't be barking back unless you failed to create the monitor, or if gatherkeys failed. On Thu, May 23, 2013 at 9:09 PM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: I just found that #ceph-deploy gatherkeys ceph0 ceph1 ceph2 works only if I have bobtail. cuttlefish can't find ceph.client.admin. keyring and then when I try this on bobtail, it says, root@cephdeploy:~/12.04# ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3 ceph-disk: Error: Device is mounted: /dev/sda3 Traceback (most recent call last): File /usr/bin/ceph-deploy, line 22, in module main() File /usr/lib/pymodules/python2.7/ceph_deploy/cli.py, line 112, in main return args.func(args) File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 293, in osd prepare(args, cfg, activate_prepared_disk=True) File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 177, in prepare dmcrypt_dir=args.dmcrypt_key_dir, File /usr/lib/python2.7/dist-packages/pushy/protocol/proxy.py, line 255, in lambda (conn.operator(type_, self, args, kwargs)) File /usr/lib/python2.7/dist-packages/pushy/protocol/connection.py, line 66, in operator return self.send_request(type_, (object, args, kwargs)) File /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py, line 323, in send_request return self.__handle(m) File /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py, line 639, in __handle raise e pushy.protocol.proxy.ExceptionProxy: Command '['ceph-disk-prepare', '--', '/dev/sda3']' returned non-zero exit status 1 root@cephdeploy:~/12.04# On Thu, May 23, 2013 at 10:49 PM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: Hi, I tried ceph-deploy all day. Found that it has a python-setuptools as dependency. I knew about python-pushy. But is there any other dependency that I'm missing? The problem I'm getting are as follows: #ceph-deploy gatherkeys ceph0 ceph1 ceph2 returns the following error, Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph0', 'ceph1', 'ceph2'] Once I got passed this, I don't know why it works sometimes. I have been following the exact steps as mentioned in the blog. Then when I try to do ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3 It gets stuck. I'm using Ubuntu 13.04 for ceph-deploy and 12.04 for ceph nodes. I just need to get the cuttlefish working and willing to change the OS if it is required. Please help. :) Best Regards, Dewan Shamsul Alam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- John Wilkins Senior Technical Writer Intank john.wilk...@inktank.com (415) 425-9599 http://inktank.com -- John Wilkins Senior Technical Writer Intank john.wilk...@inktank.com (415) 425-9599
Re: [ceph-users] ceph-deploy
Sorry...hit send inadvertantly... http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only On Mon, Jun 3, 2013 at 1:00 PM, John Wilkins john.wilk...@inktank.com wrote: Actually, as I said, I unmounted them first, zapped the disk, then used OSD create. For you, that might look like: sudo umount /dev/sda3 ceph-deploy disk zap ceph0:sda3 ceph1:sda3 ceph2:sda3 ceph-deploy osd create ceph0:sda3 ceph1:sda3 ceph2:sda3 I was referring to the entire disk in my deployment, but I wasn't using partitions on the same disk. So ceph-deploy created the data and journal partitions for me. If you are running multiple OSDs on the same disk (not recommended, except for evaluation), you'd want to use the following procedure: On Sat, Jun 1, 2013 at 7:57 AM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: Hi John, I have a feeling that I am missing something. Previously when I succeeded with bobtail with mkcephfs, I mounted the /dev/sdb1 partitions. There is nothing mentioned in the blog about it though. Say I have 3 nodes ceph201 ceph202 and ceph 203. Each has a /dev/sdb1 partition formatted as xfs. Do I need to mount them in a particular directory prior running the command or ceph-deploy would take care of it? On Thu, May 30, 2013 at 8:17 PM, John Wilkins john.wilk...@inktank.com wrote: Dewan, I encountered this too. I just did umount and reran the command and it worked for me. I probably need to add a troubleshooting section for ceph-deploy. On Fri, May 24, 2013 at 4:00 PM, John Wilkins john.wilk...@inktank.com wrote: ceph-deploy does have an ability to push the client keyrings. I haven't encountered this as a problem. However, I have created a monitor and not seen it return a keyring. In other words, it failed but didn't give me a warning message. So I just re-executed creating the monitor. The directory from where you execute ceph-deploy mon create should have a ceph.client.admin.keyring too. If it doesn't, you might have had a problem creating the monitor. I don't believe you have to push the ceph.client.admin.keyring to all the nodes. So it shouldn't be barking back unless you failed to create the monitor, or if gatherkeys failed. On Thu, May 23, 2013 at 9:09 PM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: I just found that #ceph-deploy gatherkeys ceph0 ceph1 ceph2 works only if I have bobtail. cuttlefish can't find ceph.client.admin. keyring and then when I try this on bobtail, it says, root@cephdeploy:~/12.04# ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3 ceph-disk: Error: Device is mounted: /dev/sda3 Traceback (most recent call last): File /usr/bin/ceph-deploy, line 22, in module main() File /usr/lib/pymodules/python2.7/ceph_deploy/cli.py, line 112, in main return args.func(args) File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 293, in osd prepare(args, cfg, activate_prepared_disk=True) File /usr/lib/pymodules/python2.7/ceph_deploy/osd.py, line 177, in prepare dmcrypt_dir=args.dmcrypt_key_dir, File /usr/lib/python2.7/dist-packages/pushy/protocol/proxy.py, line 255, in lambda (conn.operator(type_, self, args, kwargs)) File /usr/lib/python2.7/dist-packages/pushy/protocol/connection.py, line 66, in operator return self.send_request(type_, (object, args, kwargs)) File /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py, line 323, in send_request return self.__handle(m) File /usr/lib/python2.7/dist-packages/pushy/protocol/baseconnection.py, line 639, in __handle raise e pushy.protocol.proxy.ExceptionProxy: Command '['ceph-disk-prepare', '--', '/dev/sda3']' returned non-zero exit status 1 root@cephdeploy:~/12.04# On Thu, May 23, 2013 at 10:49 PM, Dewan Shamsul Alam dewan.sham...@gmail.com wrote: Hi, I tried ceph-deploy all day. Found that it has a python-setuptools as dependency. I knew about python-pushy. But is there any other dependency that I'm missing? The problem I'm getting are as follows: #ceph-deploy gatherkeys ceph0 ceph1 ceph2 returns the following error, Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph0', 'ceph1', 'ceph2'] Once I got passed this, I don't know why it works sometimes. I have been following the exact steps as mentioned in the blog. Then when I try to do ceph-deploy osd create ceph0:/dev/sda3 ceph1:/dev/sda3 ceph2:/dev/sda3 It gets stuck. I'm using Ubuntu 13.04 for ceph-deploy and 12.04 for ceph nodes. I just need to get the cuttlefish working and willing to change the OS if it is required. Please help. :) Best Regards, Dewan Shamsul Alam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com --
Re: [ceph-users] replacing an OSD or crush map sensitivity
my 0.02, you really dont need to wait for health_ok between your recovery steps,just go ahead. Everytime a new map be generated and broadcasted,the old map and in-progress recovery will be canceled 发自我的 iPhone 在 2013-6-2,11:30,Nigel Williams nigel.d.willi...@gmail.com 写道: Could I have a critique of this approach please as to how I could have done it better or whether what I experienced simply reflects work still to be done. This is with Ceph 0.61.2 on a quite slow test cluster (logs shared with OSDs, no separate journals, using CephFS). I knocked the power cord out from a storage node taking down 4 of the hosted OSDs, all but one came back ok. This is one OSD out of a total of 12 so 1/12 of the storage. Losing an OSD put the cluster into recovery, so all good. Next action was how to get the missing (downed) OSD back online. The OSD was xfs based and so I had to throw away the xfs log to get it to mount. Having done this and getting it re-mounted Ceph then started throwing issue #4855 (I added dmesg and logs to that issue if it helps - I am wonder if throwing away the xfs log caused an internal OSD inconsistency? and this causes issue #4855?). Given that I could not recover this OSD as far as Ceph is concerned I decided to delete and rebuild it. Several hours later, cluster was back to HEALTH_OK. I proceeded to remove and re-add the bad OSD. I following the doc suggestions to do this. The problem is we each change, it caused a slight change in the crush map, resulting in the cluster going back into recovery, adding several hours wait for each change. I chose to wait until the cluster was back to HEALTH_OK before doing the next step. Overall it has taken a few days to finally get a single OSD back into the cluster. At one point during recovery the full threshold was triggered on a single OSD causing the recovery to stop, doing ceph pg set_full_ratio 0.98 did not help. I was not planning to add data to the cluster while doing recovery operations and did not understand the suggestion the PGs could be deleted to make space on a full OSD, so I expect raising the threshold was the best option but it had no (immediate) effect. I am now back to having all 12 OSDs in and the hopefully final recovery under way while it re-balances the OSDs, although I note I am still getting the full OSD warning I am expecting this to disappear soon now that the 12th OSD is back online. During this recovery the percentage degraded has been a little confusing. While the 12th OSD was offline the percentages were around 15-20% IIRC. But now I see the percentage is 35% and slowly dropping, not sure I understand the ratios and why so high with a single missing OSD. A few documentation errors caused confusion too. This page still contains errors in the steps to create a new OSD (manually): http://eu.ceph.com/docs/wip-3060/cluster-ops/add-or-rm-osds/#adding-an-osd-manual ceph osd create {osd-num} should be ceph osd create and this: http://eu.ceph.com/docs/wip-3060/cluster-ops/crush-map/#addosd I had to put host= to get the command accepted. Suggestions and questions: 1. Is there a way to get documentation pages fixed? or at least health-warnings on them: This page badly needs updating since it is wrong/misleading 2. We need a small set of definitive succinct recipes that provide steps to recover from common failures with a narrative around what to expect at each step (your cluster will be in recovery here...). 3. Some commands are throwing erroneous errors that are actually benign :ceph-osd -i 10 --mkfs --mkkey complains about failures that are expected as the OSD is initially empty. 4. An easier way to capture the state of the cluster for analysis. I don't feel confident that when asked for logs that I am giving the most useful snippets or the complete story. It seems we need a tool that can gather all this in a neat bundle for later dissection or forensics. 5. Is there a more straightforward (faster) way getting an OSD back online. It almost seems like it is worth having a standby OSD ready to step in and assume duties (a hot spare?). 6. Is there a way to make the crush map less sensitive to changes during recovery operations? I would have liked to stall/slow recovery while I replaced the OSD then let it run at full speed. Excuses: I'd be happy to action suggestions but my current level of Ceph understanding is still too limited that effort on my part is unproductive; I am prodding the community to see if there is consensus on the need. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CentOS + qemu-kvm rbd support update
Hi Andrel, Have you tried the patched ones at https://objects.dreamhost.com/rpms/qemu/qemu-kvm-0.12.1.2-2.355.el6.2.x86_64.rpmand https://objects.dreamhost.com/rpms/qemu/qemu-img-0.12.1.2-2.355.el6.2.x86_64.rpm? I got the links off the IRC chat, I'm using them now. - WP On Sun, Jun 2, 2013 at 8:41 AM, Andrei Mikhailovsky and...@arhont.comwrote: Hello guys, Was wondering if there are any news on the CentOS 6 qemu-kvm packages with rbd support? I am very keen to try it out. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-1.4.2 rbd-fixed ubuntu packages
Thanks for your clarification. I don't have much in-depth knowledge of libvirt although I believe openstack does use it for scheduling nova compute jobs (initiating VM instances) and supporting live-migration, both of which work properly in our grizzly environment. I will keep an eye on this and report back if I see any similar issues. --weiguo Date: Mon, 3 Jun 2013 08:10:51 +0200 From: wolfgang.hennerbich...@risc-software.at To: ws...@hotmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] qemu-1.4.2 rbd-fixed ubuntu packages On Wed, May 29, 2013 at 04:16:14PM +0200, w sun wrote: Hi Wolfgang, Can you elaborate the issue for 1.5 with libvirt? Wonder if that will impact the usage with Grizzly. Did a quick compile for 1.5 with RBD support enabled, so far it seems to be ok for openstack with a few simple tests. But definitely want to be cautious if there is known integration issue with 1.5. Thanks. --weiguo I basically couldn't make the vm boot with libvirt. Libvirt complained about a missing monitor command (not ceph monitor, but kvm monitor file or something). I didn't want to start upgrading libvirt too, so I stepped back to 1.4.2. Wolfgang -- http://www.wogri.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG active+clean+degraded, but not creating new replicas
Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] replacing an OSD or crush map sensitivity
On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote: my 0.02, you really dont need to wait for health_ok between your recovery steps,just go ahead. Everytime a new map be generated and broadcasted,the old map and in-progress recovery will be canceled thanks Xiaoxi, that is helpful to know. It seems to me that there might be a failure-mode (or race-condition?) here though, as the cluster is now struggling to recover as the replacement OSD caused the cluster to go into backfill_toofull. The failure sequence might be: 1. From HEALTH_OK crash an OSD 2. Wait for recovery 3. Remove OSD using usual procedures 4. Wait for recovery 5. Add back OSD using usual procedures 6. Wait for recovery 7. Cluster is unable to recover due to toofull conditions Perhaps this is a needed test case to round-trip a cluster through a known failure/recovery scenario. Note this is using a simplistically configured test-cluster with CephFS in the mix and about 2.5 million files. Something else I noticed: I restarted the cluster (and set the leveldb compact option since I'd run out of space on the roots) and now I see it is again making progress on the backfill. Seems odd that the cluster pauses but a restart clears the pause, is that by design? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
On Tue, 4 Jun 2013, YIP Wai Peng wrote: Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal sage One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] replacing an OSD or crush map sensitivity
On Tue, 4 Jun 2013, Nigel Williams wrote: On 4/06/2013 9:16 AM, Chen, Xiaoxi wrote: my 0.02? you really dont need to wait for health_ok between your recovery steps,just go ahead. Everytime a new map be generated and broadcasted,the old map and in-progress recovery will be canceled thanks Xiaoxi, that is helpful to know. It seems to me that there might be a failure-mode (or race-condition?) here though, as the cluster is now struggling to recover as the replacement OSD caused the cluster to go into backfill_toofull. The failure sequence might be: 1. From HEALTH_OK crash an OSD 2. Wait for recovery 3. Remove OSD using usual procedures 4. Wait for recovery 5. Add back OSD using usual procedures 6. Wait for recovery 7. Cluster is unable to recover due to toofull conditions Perhaps this is a needed test case to round-trip a cluster through a known failure/recovery scenario. Note this is using a simplistically configured test-cluster with CephFS in the mix and about 2.5 million files. Something else I noticed: I restarted the cluster (and set the leveldb compact option since I'd run out of space on the roots) and now I see it is again making progress on the backfill. Seems odd that the cluster pauses but a restart clears the pause, is that by design? Does the monitor data directory share a disk with an OSD? If so, that makes sense: compaction freed enough space to drop below the threshold... sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
Hi Sage, It is on optimal tunables already. However, I'm on kernel 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to upgrade to something newer? - WP On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, YIP Wai Peng wrote: Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal sage One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] replacing an OSD or crush map sensitivity
On Tue, Jun 4, 2013 at 1:59 PM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, Nigel Williams wrote: Something else I noticed: ... Does the monitor data directory share a disk with an OSD? If so, that makes sense: compaction freed enough space to drop below the threshold... Of course! that is exactly it, thanks - scratch that last observation, red herring. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
On Mon, Jun 03, 2013 at 08:58:00PM -0700, Sage Weil wrote: My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal One thing that I'm not quite sure about - in the documentation we learn: The ceph-osd and ceph-mon daemons will start requiring the feature bits of new connections as soon as they get the updated map. However, already-connected clients are effectively grandfathered in, and will misbehave if they do not support the new feature. So: Am I in danger when I set this to optimal in a productive bobtail-cluster with qemu-rbd being the only client around? sage Wolfgang -- http://www.wogri.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
Sorry, to set things in context, I had some other problems last weekend. Setting it to optimal tunables helped (although I am on the older kernel). Since it worked, I was inclined to believed that the tunables do work on the older kernel. That being said, I will upgrade the kernel to see if this issue goes away. Regards, Wai Peng On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg wrote: Hi Sage, It is on optimal tunables already. However, I'm on kernel 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to upgrade to something newer? - WP On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, YIP Wai Peng wrote: Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal sage One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
On Tue, 4 Jun 2013, Wolfgang Hennerbichler wrote: On Mon, Jun 03, 2013 at 08:58:00PM -0700, Sage Weil wrote: My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal One thing that I'm not quite sure about - in the documentation we learn: The ceph-osd and ceph-mon daemons will start requiring the feature bits of new connections as soon as they get the updated map. However, already-connected clients are effectively grandfathered in, and will misbehave if they do not support the new feature. So: Am I in danger when I set this to optimal in a productive bobtail-cluster with qemu-rbd being the only client around? The tunables were added in v0.55 (just prior to bobtail), so you should be in good shape. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
On Tue, 4 Jun 2013, YIP Wai Peng wrote: Sorry, to set things in context, I had some other problems last weekend. Setting it to optimal tunables helped (although I am on the older kernel). Since it worked, I was inclined to believed that the tunables do work on the older kernel. That being said, I will upgrade the kernel to see if this issue goes away. The kernel version is only an issue if you are using the cephfs or rbd *client* from the kernel (e.g., rbd map ... or mount -t ceph ...). (Ceph didn't appear upstream until 2.6.35 or thereabouts, and fixes are only backported as far as v3.4.) sage Regards, Wai Peng On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg wrote: Hi Sage, It is on optimal tunables already. However, I'm on kernel 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to upgrade to something newer? - WP On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, YIP Wai Peng wrote: Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal sage One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG active+clean+degraded, but not creating new replicas
Hi Sage, Thanks, I noticed after re-reading the documentation. I realized that osd.8 was not in host3. After adding osd.8 to host3, the PGs are now in active+remapped # ceph pg 3.45 query { state: active+remapped, epoch: 1374, up: [ 4, 8], acting: [ 4, 8, 6], snip Still, nothing is happening. What can be wrong? - WP On Tue, Jun 4, 2013 at 12:26 PM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, YIP Wai Peng wrote: Sorry, to set things in context, I had some other problems last weekend. Setting it to optimal tunables helped (although I am on the older kernel). Since it worked, I was inclined to believed that the tunables do work on the older kernel. That being said, I will upgrade the kernel to see if this issue goes away. The kernel version is only an issue if you are using the cephfs or rbd *client* from the kernel (e.g., rbd map ... or mount -t ceph ...). (Ceph didn't appear upstream until 2.6.35 or thereabouts, and fixes are only backported as far as v3.4.) sage Regards, Wai Peng On Tue, Jun 4, 2013 at 12:01 PM, YIP Wai Peng yi...@comp.nus.edu.sg wrote: Hi Sage, It is on optimal tunables already. However, I'm on kernel 2.6.32-358.6.2.el6.x86_64. Will the tunables take effect or do I have to upgrade to something newer? - WP On Tue, Jun 4, 2013 at 11:58 AM, Sage Weil s...@inktank.com wrote: On Tue, 4 Jun 2013, YIP Wai Peng wrote: Hi all, I'm running ceph on CentOS6 on 3 hosts, with 3 OSD each (total 9 OSD). When I increased one of my pool rep size from 2 to 3, just 6 PGs will get stuck in active+clean+degraded mode, but it doesn't create new replicas. My first guess is that you do not have the newer crush tunables set and some placements are not quite right. If you are prepared for some data migration, and are not using an older kernel client, try ceph osd crush tunables optimal sage One of the problematic PG has the following (snipped for brevity) { state: active+clean+degraded, epoch: 1329, up: [ 4, 6], acting: [ 4, 6], snip recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2013-06-04 01:10:30.092977, might_have_unfound: [ { osd: 3, status: already probed}, { osd: 5, status: not queried}, { osd: 6, status: already probed}], snip I tried force_create_pg but it gets stuck in creating. Any ideas on how to kickstart this node to create the correct numbers of replicas? PS: I have the following crush rule for the pool, which makes the replicas go to different hosts. host1 has OSD 0,1,2 host2 has OSD 3,4,5 host3 has OSD 6,7,8 Looking at it, the new replica should be going to OSD 0,1,2, but ceph is not creating it? rule different_host { ruleset 3 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Any help will be much appreciated. Cheers - Wai Peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com