Re: [ceph-users] I have PGs that I can't deep-scrub
I fixed this issue by reformatting all of the OSDs. I changed the mkfs options from [osd] osd mkfs type = xfs osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096 to [osd] osd mkfs type = xfs osd mkfs options xfs = -s size=4096 (I have a mix of 512 and 4k sector drives, and I want to treat them all like 4k sector). Now deep scrub runs to completion, and CPU usage of the daemon never goes over 30%. I did have to restart a few OSDs when I scrubbed known problem PGs, but they scrubbed the 2nd time successfully. The cluster is still scrubbing, but it's completed half with no more issues. It took me a long time to correlate the XFS: possible memory allocation deadlock in kmem_alloc message in dmesg to OSD problems. It was only when I started having these deep-scrub issues that the XFS deadlock messages were well correlated with OSD issues. Looking back at previous issues I had with OSDs flapping, the XFS deadlocks were present, but usually preceded the issues by several hours. I strongly recommend to anybody that sees XFS: possible memory allocation deadlock in kmem_alloc in dmesg to reformat your XFS filesystems. It's painful, but my cluster has been rock solid since I finished. On Wed, Jun 11, 2014 at 2:23 PM, Craig Lewis cle...@centraldesktop.com wrote: New logs, with debug ms = 1, debug osd = 20. In this timeline, I started the deep-scrub at 11:04:00 Ceph start deep-scrubing at 11:04:03. osd.11 started consuming 100% CPU around 11:07. Same for osd.0. CPU usage is all user; iowait is 0.10%. There is more variance in the CPU usage now, ranging between 98.5% and 101.2% This time, I didn't see any major IO, read or write. osd.11 was marked down at 11:22:00: 2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg stats for 902.656777seconds osd.0 was marked down at 11:36:00: 2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg stats for 902.498894seconds ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE ceph-osd.0.log.gz https://cd.centraldesktop.com/p/eAAADwbcABIDZuEceph-osd.0.log.gz (140MiB, 18MiB compressed): https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ ceph-osd.11.log.gz (131MiB, 17MiB compressed): https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil s...@inktank.com wrote: Hi Craig, It's hard to say what is going wrong with that level of logs. Can you reproduce with debug ms = 1 and debug osd = 20? There were a few things fixed in scrub between emperor and firefly. Are you planning on upgrading soon? sage On Tue, 10 Jun 2014, Craig Lewis wrote: Every time I deep-scrub one PG, all of the OSDs responsible get kicked out of the cluster. I've deep-scrubbed this PG 4 times now, and it fails the same way every time. OSD logs are linked at the bottom. What can I do to get this deep-scrub to complete cleanly? This is the first time I've deep-scrubbed these PGs since Sage helped me recover from some OSD problems ( http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html ) I can trigger the issue easily in this cluster, but have not been able to re-create in other clusters. The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp are 48009'1904117 2014-05-21 07:28:01.315996 respectively. This PG is owned by OSDs [11,0] This is a secondary cluster, so I stopped all external I/O on it. I set nodeep-scrub, and restarted both OSDs with: debug osd = 5/5 debug filestore = 5/5 debug journal = 1 debug monc = 20/20 then I ran a deep-scrub on this PG. 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555 active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554 active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU (100.3% +/- 1.0%). Prior to this, they were both using ~30% CPU. It might've started a few seconds sooner, I'm watching top. I forgot to watch IO stat until 10:56. At this point, both OSDs are reading. iostat reports that they're both doing ~100 transactions/sec, reading ~1 MiBps, 0 writes. At 11:01:26, iostat reports that both osds are no longer consuming any disk I/O. They both go for 30 seconds with 0 transactions, and 0 kiB read/write. There are small bumps of 2 transactions/sec for one second, then it's back to 0. At 11:02:41, the primary OSD gets kicked out by the monitors: 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
Re: [ceph-users] I have PGs that I can't deep-scrub
Hi Craig, On Thu, Jul 10, 2014 at 03:09:51PM -0700, Craig Lewis wrote: I fixed this issue by reformatting all of the OSDs. I changed the mkfs options from [osd] osd mkfs type = xfs osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096 to [osd] osd mkfs type = xfs osd mkfs options xfs = -s size=4096 (I have a mix of 512 and 4k sector drives, and I want to treat them all like 4k sector). Now deep scrub runs to completion, and CPU usage of the daemon never goes over 30%. I did have to restart a few OSDs when I scrubbed known problem PGs, but they scrubbed the 2nd time successfully. The cluster is still scrubbing, but it's completed half with no more issues. I suspect it was the -n size=64k causing this behaviour, potentially using too much CPU and starving the OSD processes: http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] I have PGs that I can't deep-scrub
Hi Craig, It's hard to say what is going wrong with that level of logs. Can you reproduce with debug ms = 1 and debug osd = 20? There were a few things fixed in scrub between emperor and firefly. Are you planning on upgrading soon? sage On Tue, 10 Jun 2014, Craig Lewis wrote: Every time I deep-scrub one PG, all of the OSDs responsible get kicked out of the cluster. I've deep-scrubbed this PG 4 times now, and it fails the same way every time. OSD logs are linked at the bottom. What can I do to get this deep-scrub to complete cleanly? This is the first time I've deep-scrubbed these PGs since Sage helped me recover from some OSD problems (http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html) I can trigger the issue easily in this cluster, but have not been able to re-create in other clusters. The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp are 48009'1904117 2014-05-21 07:28:01.315996 respectively. This PG is owned by OSDs [11,0] This is a secondary cluster, so I stopped all external I/O on it. I set nodeep-scrub, and restarted both OSDs with: debug osd = 5/5 debug filestore = 5/5 debug journal = 1 debug monc = 20/20 then I ran a deep-scrub on this PG. 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555 active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554 active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU (100.3% +/- 1.0%). Prior to this, they were both using ~30% CPU. It might've started a few seconds sooner, I'm watching top. I forgot to watch IO stat until 10:56. At this point, both OSDs are reading. iostat reports that they're both doing ~100 transactions/sec, reading ~1 MiBps, 0 writes. At 11:01:26, iostat reports that both osds are no longer consuming any disk I/O. They both go for 30 seconds with 0 transactions, and 0 kiB read/write. There are small bumps of 2 transactions/sec for one second, then it's back to 0. At 11:02:41, the primary OSD gets kicked out by the monitors: 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555 active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2 op/s 2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg stats for 903.825187seconds 2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range). At ~11:10, I see that osd.11 has resumed reading from disk at the original levels (~100 tps, ~1MiBps read, 0 MiBps write). Since it's down, but doing something, I let it run. Both the osd.11 and osd.0 repeat this pattern. Reading for a while at ~1 MiBps, then nothing. The duty cycle seems about 50%, with a 20 minute period, but I haven't timed anything. CPU usage remains at 100%, regardless of whether IO is happening or not. At 12:24:15, osd.11 rejoins the cluster: 2014-06-10 12:24:15.294646 mon.0 [INF] osd.11 10.193.0.7:6804/7100 boot 2014-06-10 12:24:15.294725 mon.0 [INF] osdmap e58838: 36 osds: 35 up, 36 in 2014-06-10 12:24:15.343869 mon.0 [INF] pgmap v8832827: 2560 pgs: 1 stale+active+clean+scrubbing+deep, 2266 active+clean, 5 stale+active+clean, 287 active+degraded, 1 active+clean+scrubbing; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 15650 B/s rd, 18 op/s; 3617854/61758142 objects degraded (5.858%) osd.0's CPU usage drops back to normal when osd.11 rejoins the cluster. The PG stats have not changed. The last_deep_scrub and deep_scrub_stamp are still 48009'1904117 2014-05-21 07:28:01.315996 respectively. This time, osd.0 did not get kicked out by the monitors. In previous attempts, osd.0 was kicked out 5-10 minutes after osd.11. When that happens, osd.0 rejoins the cluster after osd.11. I have several more PGs exhibiting the same behavior. At least 3 that I know of, and many more that I haven't attempted to deep-scrub. ceph -v: ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) ceph.conf: https://cd.centraldesktop.com/p/eAAADvxuAHJRUk4 ceph-osd.11.log (5.7 MiB): https://cd.centraldesktop.com/p/eAAADvxyABPwaeM ceph-osd.0.log (6.3 MiB): https://cd.centraldesktop.com/p/eAAADvx0ADWEGng ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADvxvAAylTW0 (the pg query was collected at 13:24, after the above events) Things that probably don't matter: The OSD partitions were created using ceph-disk-prepare --dmcrypt. -- To unsubscribe from this list: send the line unsubscribe
Re: [ceph-users] I have PGs that I can't deep-scrub
New logs, with debug ms = 1, debug osd = 20. In this timeline, I started the deep-scrub at 11:04:00 Ceph start deep-scrubing at 11:04:03. osd.11 started consuming 100% CPU around 11:07. Same for osd.0. CPU usage is all user; iowait is 0.10%. There is more variance in the CPU usage now, ranging between 98.5% and 101.2% This time, I didn't see any major IO, read or write. osd.11 was marked down at 11:22:00: 2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg stats for 902.656777seconds osd.0 was marked down at 11:36:00: 2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg stats for 902.498894seconds ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE ceph-osd.0.log.gz (140MiB, 18MiB compressed): https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ ceph-osd.11.log.gz (131MiB, 17MiB compressed): https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil s...@inktank.com wrote: Hi Craig, It's hard to say what is going wrong with that level of logs. Can you reproduce with debug ms = 1 and debug osd = 20? There were a few things fixed in scrub between emperor and firefly. Are you planning on upgrading soon? sage On Tue, 10 Jun 2014, Craig Lewis wrote: Every time I deep-scrub one PG, all of the OSDs responsible get kicked out of the cluster. I've deep-scrubbed this PG 4 times now, and it fails the same way every time. OSD logs are linked at the bottom. What can I do to get this deep-scrub to complete cleanly? This is the first time I've deep-scrubbed these PGs since Sage helped me recover from some OSD problems (http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html) I can trigger the issue easily in this cluster, but have not been able to re-create in other clusters. The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp are 48009'1904117 2014-05-21 07:28:01.315996 respectively. This PG is owned by OSDs [11,0] This is a secondary cluster, so I stopped all external I/O on it. I set nodeep-scrub, and restarted both OSDs with: debug osd = 5/5 debug filestore = 5/5 debug journal = 1 debug monc = 20/20 then I ran a deep-scrub on this PG. 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555 active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554 active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU (100.3% +/- 1.0%). Prior to this, they were both using ~30% CPU. It might've started a few seconds sooner, I'm watching top. I forgot to watch IO stat until 10:56. At this point, both OSDs are reading. iostat reports that they're both doing ~100 transactions/sec, reading ~1 MiBps, 0 writes. At 11:01:26, iostat reports that both osds are no longer consuming any disk I/O. They both go for 30 seconds with 0 transactions, and 0 kiB read/write. There are small bumps of 2 transactions/sec for one second, then it's back to 0. At 11:02:41, the primary OSD gets kicked out by the monitors: 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555 active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2 op/s 2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg stats for 903.825187seconds 2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range). At ~11:10, I see that osd.11 has resumed reading from disk at the original levels (~100 tps, ~1MiBps read, 0 MiBps write). Since it's down, but doing something, I let it run. Both the osd.11 and osd.0 repeat this pattern. Reading for a while at ~1 MiBps, then nothing. The duty cycle seems about 50%, with a 20 minute period, but I haven't timed anything. CPU usage remains at 100%, regardless of whether IO is happening or not. At 12:24:15, osd.11 rejoins the cluster: 2014-06-10 12:24:15.294646 mon.0 [INF] osd.11 10.193.0.7:6804/7100 boot 2014-06-10 12:24:15.294725 mon.0 [INF] osdmap e58838: 36 osds: 35 up, 36 in 2014-06-10 12:24:15.343869 mon.0 [INF] pgmap v8832827: 2560 pgs: 1 stale+active+clean+scrubbing+deep, 2266 active+clean, 5 stale+active+clean, 287 active+degraded, 1 active+clean+scrubbing; 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 15650 B/s rd, 18 op/s; 3617854/61758142 objects degraded (5.858%) osd.0's CPU usage drops back to normal when osd.11 rejoins the cluster. The PG stats have not changed. The last_deep_scrub and