Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
Hi Craig, On Fri, Sep 19, 2014 at 2:49 AM, Craig Lewis cle...@centraldesktop.com wrote: No, removing the snapshots didn't solve my problem. I eventually traced this problem to XFS deadlocks caused by [osd] osd mkfs options xfs: -l size=1024m -n size=64k -i size=2048 -s size=4096 Changing to just -s size=4096, and reformatting all OSDs solved this problem. Since then, I ran into http://tracker.ceph.com/issues/5699. Snapshots are off until I've deployed Firefly. Thanks for responding. We've tracked our issue down to http://tracker.ceph.com/issues/9487 in the interim. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
Excellent find. On Fri, Sep 19, 2014 at 7:11 AM, Florian Haas flor...@hastexo.com wrote: Hi Craig, On Fri, Sep 19, 2014 at 2:49 AM, Craig Lewis cle...@centraldesktop.com wrote: No, removing the snapshots didn't solve my problem. I eventually traced this problem to XFS deadlocks caused by [osd] osd mkfs options xfs: -l size=1024m -n size=64k -i size=2048 -s size=4096 Changing to just -s size=4096, and reformatting all OSDs solved this problem. Since then, I ran into http://tracker.ceph.com/issues/5699. Snapshots are off until I've deployed Firefly. Thanks for responding. We've tracked our issue down to http://tracker.ceph.com/issues/9487 in the interim. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
No, removing the snapshots didn't solve my problem. I eventually traced this problem to XFS deadlocks caused by [osd] osd mkfs options xfs: -l size=1024m -n size=64k -i size=2048 -s size=4096 Changing to just -s size=4096, and reformatting all OSDs solved this problem. Since then, I ran into http://tracker.ceph.com/issues/5699. Snapshots are off until I've deployed Firefly. On Wed, Sep 17, 2014 at 8:09 AM, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? You mentioned in a later message in the same thread that you would keep your snapshot script running and repeat the experiment. Did the situation change in any way after that? Did the issue come back? Or did you just stop using snapshots altogether? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? You mentioned in a later message in the same thread that you would keep your snapshot script running and repeat the experiment. Did the situation change in any way after that? Did the issue come back? Or did you just stop using snapshots altogether? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
Hi Florian, On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? Could your ticket be related to the snap trimming issue I’ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it’s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) Cheers, Dan You mentioned in a later message in the same thread that you would keep your snapshot script running and repeat the experiment. Did the situation change in any way after that? Did the issue come back? Or did you just stop using snapshots altogether? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? Could your ticket be related to the snap trimming issue I’ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it’s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) That sounds promising, thank you! debug_osd=10 should actually be sufficient as those snap_trim messages get logged at that level. :) Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional... Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
Hi, (Sorry for top posting, mobile now). That's exactly what I observe -- one sleep per PG. The problem is that the sleep can't simply be moved since AFAICT the whole PG is locked for the duration of the trimmer. So the options I proposed are to limit the number of snaps trimmed per call to e.g 16, or to fix the loss of purged_snaps after backfilling. Actually, probably both of those are needed. But a real dev would know better. Cheers, Dan From: Florian Haas flor...@hastexo.com Sent: Sep 17, 2014 5:33 PM To: Dan Van Der Ster Cc: Craig Lewis cle...@centraldesktop.com;ceph-users@lists.ceph.com Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? Could your ticket be related to the snap trimming issue I’ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it’s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) That sounds promising, thank you! debug_osd=10 should actually be sufficient as those snap_trim messages get logged at that level. :) Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional... Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: From: Florian Haas flor...@hastexo.com Sent: Sep 17, 2014 5:33 PM To: Dan Van Der Ster Cc: Craig Lewis cle...@centraldesktop.com;ceph-users@lists.ceph.com Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? Could your ticket be related to the snap trimming issue I’ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it’s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) That sounds promising, thank you! debug_osd=10 should actually be sufficient as those snap_trim messages get logged at that level. :) Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional... Cheers, Florian Hi, (Sorry for top posting, mobile now). I've taken the liberty to reformat. :) That's exactly what I observe -- one sleep per PG. The problem is that the sleep can't simply be moved since AFAICT the whole PG is locked for the duration of the trimmer. So the options I proposed are to limit the number of snaps trimmed per call to e.g 16, or to fix the loss of purged_snaps after backfilling. Actually, probably both of those are needed. But a real dev would know better. Okay. Certainly worth a try. Thanks again! I'll let you know when I know more. Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
On 3/27/14 18:04 , Craig Lewis wrote: I'm trying to use strace on osd.4: strace -tt -f -ff -o ./ceph-osd.4.strace -x /usr/bin/ceph-osd --cluster=ceph -i 4 -f So far, strace is running, and the process isn't hung. After I ran this, the cluster finally finished backfilling the last of the PGs (all on osd.4). Since the cluster is healthy again, I killed the strace, and started daemon normally (start ceph-osd id=4). Things seem fine now. I'm going to let it scrub and deepscrub overnight. I'll restart radosgw-agent tomorrow. This seems to have resolved the issue. The cluster completed recovery while I was strace'ing osd.4, and hasn't had any issues since then. I restarted radosgw-agent, and it's running fine. I don't think the snapshots are related, but I don't know. The snapshots I deleted were taken over a 2 week period, and covered an increase of 40% of the cluster data size. The snapshot cron is still active, so I guess I'll repeat the experiment. If the issue comes back in a couple weeks, I try the strace without removing the snapshots. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
The osd.8 log shows it doing some deep scrubbing here. Perhaps that is what caused your earlier issues with CPU usage? When I first noticed the CPU usage, I checked iotop and iostat. Both said there was no disk activity, on any OSD. At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions list radosgw-admin --name=client.radosgw.ceph1c regionmap get. regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8. At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions list radosgw-admin --name=client.radosgw.ceph1c regionmap get. It returned successfully. At 14:19:10, I stopped ceph-osd id=*/4/*. Since you've got the noout flag set, when osd.8 goes down any objects for which osd.8 is the primary will not be readable. Since ceph reads from primaries, and the noout flag prevents another osd from being selected, which would happen if osd.8 were marked out, these objects (which apparently happen to include some needed for regions list or regionmap get) are inaccessible. Josh Taking osd.8 down (regardless of the noout flag) was the only way to things to respond. I have not set nodown, just noout. When I got in this morning, I had 4 more flapping OSDs: osd.4, osd.12, osd.13, and osd.6. All 4 daemons were all using 100% CPU, and no disk I/O. osd.1 and osd.14 are the only ones currently using disk I/O. There are 3 PGs being deepscrubbed: root@ceph1c:/var/log/radosgw-agent# ceph pg dump | grep deep dumped all in format plain pg_statobjectsmipdegrunfbyteslog disklog statestate_stampvreported *up* *acting*last_scrub scrub_stamplast_deep_scrub deep_scrub_stamp 11.774868200076146550603001 3001 active+clean+scrubbing+deep2014-03-27 10:20:30.598032 8381'51805148521:6520833 *[13,4]* *[13,4]*7894'5176984 2014-03-20 04:41:48.762996 7894'51769842014-03-20 04:41:48.762996 11.698858700077237371713001 3001 active+clean+scrubbing+deep2014-03-27 10:16:31.292487 8383'4833128521:618864 *[14,1]* *[14,1]*7894'479783 2014-03-20 03:53:18.024015 7894'4797832014-03-20 03:53:18.024015 11.d8874300075703659093409 3409 active+clean+scrubbing+deep2014-03-27 10:15:39.558121 8396'17534078521:2417672 *[12,6]* *[12,6]*7894'1459230 2014-03-20 02:40:22.123236 7894'14592302014-03-20 02:40:22.123236 These PGs are on the 6 OSDs mentioned. osd.1 and osd.14 are not using 100% CPU and are using disk IO. osd.12, osd.6, osd.4, and osd.13 are using 100% CPU, and 0 kB/s of disk IO. Here's iostat on ceph0c, which contains osd.1 (/dev/sdd), osd.4 (/dev/sde), and osd.6 (/dev/sdg): root@ceph0c:/var/log/ceph# iostat -p sdd,sde,sdh 1 Linux 3.5.0-46-generic (ceph0c) 03/27/2014 _x86_64_(8 CPU) snip avg-cpu: %user %nice %system %iowait %steal %idle 32.640.005.524.420.00 57.42 Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn sdd 113.00 900.00 0.00 900 0 sdd1113.00 900.00 0.00 900 0 sde 0.00 0.00 0.00 0 0 sde1 0.00 0.00 0.00 0 0 sdh 0.00 0.00 0.00 0 0 sdh1 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 29.900.004.412.820.00 62.87 Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn sdd 181.00 1332.00 0.00 1332 0 sdd1181.00 1332.00 0.00 1332 0 sde 22.00 8.00 328.00 8328 sde1 18.00 8.00 328.00 8328 sdh 18.00 4.00 228.00 4228 sdh1 15.00 4.00 228.00 4228 avg-cpu: %user %nice %system %iowait %steal %idle 30.210.004.261.710.00 63.82 Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn sdd 180.00 1044.00 200.00 1044200 sdd1177.00 1044.00 200.00 1044200 sde 0.00 0.00 0.00 0 0 sde1 0.00 0.00 0.00 0 0 sdh 0.00 0.00 0.00 0 0 sdh1 0.00 0.00 0.00 0 0 So it's not no disk activity, but it's pretty close. The disks continue to have 0 kB_read and 0kB_wrtn for the next 60 seconds. It's much lower than I would expect for OSDs executing a deepscrub. I restarted the 4 flapping OSDs. They recovered, then started flapping within 5 minutes. I shut all of the ceph daemons down, and rebooted all nodes at the same time. The OSDs return to 100% CPU usage very soon after boot. I was
Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. The cluster started recovering. Now osd.4 and osd.13 are doing this. Any suggestions for how I can see what the hung OSDs are doing? The logs don't look interesting. Is there a higher log level I can use? I'm trying to use strace on osd.4: strace -tt -f -ff -o ./ceph-osd.4.strace -x /usr/bin/ceph-osd --cluster=ceph -i 4 -f So far, strace is running, and the process isn't hung. After I ran this, the cluster finally finished backfilling the last of the PGs (all on osd.4). Since the cluster is healthy again, I killed the strace, and started daemon normally (start ceph-osd id=4). Things seem fine now. I'm going to let it scrub and deepscrub overnight. I'll restart radosgw-agent tomorrow. *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ On 3/27/14 10:44 , Craig Lewis wrote: The osd.8 log shows it doing some deep scrubbing here. Perhaps that is what caused your earlier issues with CPU usage? When I first noticed the CPU usage, I checked iotop and iostat. Both said there was no disk activity, on any OSD. At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions list radosgw-admin --name=client.radosgw.ceph1c regionmap get. regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8. At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions list radosgw-admin --name=client.radosgw.ceph1c regionmap get. It returned successfully. At 14:19:10, I stopped ceph-osd id=*/4/*. Since you've got the noout flag set, when osd.8 goes down any objects for which osd.8 is the primary will not be readable. Since ceph reads from primaries, and the noout flag prevents another osd from being selected, which would happen if osd.8 were marked out, these objects (which apparently happen to include some needed for regions list or regionmap get) are inaccessible. Josh Taking osd.8 down (regardless of the noout flag) was the only way to things to respond. I have not set nodown, just noout. When I got in this morning, I had 4 more flapping OSDs: osd.4, osd.12, osd.13, and osd.6. All 4 daemons were all using 100% CPU, and no disk I/O. osd.1 and osd.14 are the only ones currently using disk I/O. There are 3 PGs being deepscrubbed: root@ceph1c:/var/log/radosgw-agent# ceph pg dump | grep deep dumped all in format plain pg_statobjectsmipdegrunfbytes logdisklog statestate_stampvreported *up* *acting*last_scrub scrub_stamplast_deep_scrub deep_scrub_stamp 11.774868200076146550603001 3001 active+clean+scrubbing+deep2014-03-27 10:20:30.598032 8381'51805148521:6520833 *[13,4]* *[13,4]*7894'5176984 2014-03-20 04:41:48.762996 7894'51769842014-03-20 04:41:48.762996 11.698858700077237371713001 3001 active+clean+scrubbing+deep2014-03-27 10:16:31.292487 8383'4833128521:618864 *[14,1]* *[14,1]*7894'479783 2014-03-20 03:53:18.024015 7894'4797832014-03-20 03:53:18.024015 11.d8874300075703659093409 3409 active+clean+scrubbing+deep2014-03-27 10:15:39.558121 8396'17534078521:2417672 *[12,6]* *[12,6]*7894'1459230 2014-03-20 02:40:22.123236 7894'14592302014-03-20 02:40:22.123236 These PGs are on the 6 OSDs mentioned. osd.1 and osd.14 are not using 100% CPU and are using disk IO. osd.12, osd.6, osd.4, and osd.13 are using 100% CPU, and 0 kB/s of disk IO. Here's iostat on ceph0c, which contains osd.1 (/dev/sdd), osd.4 (/dev/sde), and osd.6 (/dev/sdg): root@ceph0c:/var/log/ceph# iostat -p sdd,sde,sdh 1 Linux 3.5.0-46-generic (ceph0c) 03/27/2014 _x86_64_(8 CPU) snip avg-cpu: %user %nice %system %iowait %steal %idle 32.640.005.524.420.00 57.42 Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn sdd 113.00 900.00 0.00 900 0 sdd1113.00 900.00 0.00 900 0 sde 0.00 0.00 0.00 0 0 sde1 0.00 0.00 0.00 0 0 sdh 0.00 0.00 0.00 0 0 sdh1 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 29.900.004.412.820.00 62.87 Device:tpskB_read/skB_wrtn/s kB_readkB_wrtn