Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Craig Lewis
I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
options from

[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096

to
[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -s size=4096

(I have a mix of 512 and 4k sector drives, and I want to treat them all
like 4k sector).


Now deep scrub runs to completion, and CPU usage of the daemon never goes
over 30%.  I did have to restart a few OSDs when I scrubbed known problem
PGs, but they scrubbed the 2nd time successfully.  The cluster is still
scrubbing, but it's completed half with no more issues.



It took me a long time to correlate the XFS: possible memory allocation
deadlock in kmem_alloc message in dmesg to OSD problems.  It was only when
I started having these deep-scrub issues that the XFS deadlock messages
were well correlated with OSD issues.

Looking back at previous issues I had with OSDs flapping, the XFS deadlocks
were present, but usually preceded the issues by several hours.


I strongly recommend to anybody that sees XFS: possible memory allocation
deadlock in kmem_alloc in dmesg to reformat your XFS filesystems.  It's
painful, but my cluster has been rock solid since I finished.





On Wed, Jun 11, 2014 at 2:23 PM, Craig Lewis cle...@centraldesktop.com
wrote:

 New logs, with debug ms = 1, debug osd = 20.


 In this timeline, I started the deep-scrub at 11:04:00  Ceph start
 deep-scrubing at 11:04:03.

 osd.11 started consuming 100% CPU around 11:07.  Same for osd.0.  CPU
 usage is all user; iowait is  0.10%.  There is more variance in the
 CPU usage now, ranging between 98.5% and 101.2%

 This time, I didn't see any major IO, read or write.

 osd.11 was marked down at 11:22:00:
 2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg
 stats for 902.656777seconds

 osd.0 was marked down at 11:36:00:
  2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg
 stats for 902.498894seconds




 ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE
 ceph-osd.0.log.gz
 https://cd.centraldesktop.com/p/eAAADwbcABIDZuEceph-osd.0.log.gz
 (140MiB, 18MiB compressed):
 https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ
 ceph-osd.11.log.gz (131MiB, 17MiB compressed):
 https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI
 ceph pg 40.11e query:
 https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc





 On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil s...@inktank.com wrote:
  Hi Craig,
 
  It's hard to say what is going wrong with that level of logs.  Can you
  reproduce with debug ms = 1 and debug osd = 20?
 
  There were a few things fixed in scrub between emperor and firefly.  Are
  you planning on upgrading soon?
 
  sage
 
 
  On Tue, 10 Jun 2014, Craig Lewis wrote:
 
  Every time I deep-scrub one PG, all of the OSDs responsible get kicked
  out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
  fails the same way every time.  OSD logs are linked at the bottom.
 
  What can I do to get this deep-scrub to complete cleanly?
 
  This is the first time I've deep-scrubbed these PGs since Sage helped
  me recover from some OSD problems
  (
 http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html
 )
 
  I can trigger the issue easily in this cluster, but have not been able
  to re-create in other clusters.
 
 
 
 
 
 
  The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
  are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
  owned by OSDs [11,0]
 
  This is a secondary cluster, so I stopped all external I/O on it.  I
  set nodeep-scrub, and restarted both OSDs with:
debug osd = 5/5
debug filestore = 5/5
debug journal = 1
debug monc = 20/20
 
  then I ran a deep-scrub on this PG.
 
  2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
  active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
  77870 GB / 130 TB avail
  2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
  active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
  27701 GB data, 56218 GB used, 77870 GB / 130 TB avail
 
 
  At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
  (100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
  might've started a few seconds sooner, I'm watching top.
 
  I forgot to watch IO stat until 10:56.  At this point, both OSDs are
  reading.  iostat reports that they're both doing ~100
  transactions/sec, reading ~1 MiBps, 0 writes.
 
 
  At 11:01:26, iostat reports that both osds are no longer consuming any
  disk I/O.  They both go for  30 seconds with 0 transactions, and 0
  kiB read/write.  There are small bumps of 2 transactions/sec for one
  second, then it's back to 0.
 
 
  At 11:02:41, the primary OSD gets kicked out by the monitors:
  2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
  

Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Chris Dunlop
Hi Craig,

On Thu, Jul 10, 2014 at 03:09:51PM -0700, Craig Lewis wrote:
 I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
 options from
 
 [osd]
   osd mkfs type = xfs
   osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096
 
 to
 [osd]
   osd mkfs type = xfs
   osd mkfs options xfs = -s size=4096
 
 (I have a mix of 512 and 4k sector drives, and I want to treat them all
 like 4k sector).
 
 
 Now deep scrub runs to completion, and CPU usage of the daemon never goes
 over 30%.  I did have to restart a few OSDs when I scrubbed known problem
 PGs, but they scrubbed the 2nd time successfully.  The cluster is still
 scrubbing, but it's completed half with no more issues.

I suspect it was the -n size=64k causing this behaviour, potentially using
too much CPU and starving the OSD processes:

http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I have PGs that I can't deep-scrub

2014-06-11 Thread Sage Weil
Hi Craig,

It's hard to say what is going wrong with that level of logs.  Can you 
reproduce with debug ms = 1 and debug osd = 20?

There were a few things fixed in scrub between emperor and firefly.  Are 
you planning on upgrading soon?

sage


On Tue, 10 Jun 2014, Craig Lewis wrote:

 Every time I deep-scrub one PG, all of the OSDs responsible get kicked
 out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
 fails the same way every time.  OSD logs are linked at the bottom.
 
 What can I do to get this deep-scrub to complete cleanly?
 
 This is the first time I've deep-scrubbed these PGs since Sage helped
 me recover from some OSD problems
 (http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html)
 
 I can trigger the issue easily in this cluster, but have not been able
 to re-create in other clusters.
 
 
 
 
 
 
 The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
 are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
 owned by OSDs [11,0]
 
 This is a secondary cluster, so I stopped all external I/O on it.  I
 set nodeep-scrub, and restarted both OSDs with:
   debug osd = 5/5
   debug filestore = 5/5
   debug journal = 1
   debug monc = 20/20
 
 then I ran a deep-scrub on this PG.
 
 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
 active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
 77870 GB / 130 TB avail
 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
 active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail
 
 
 At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
 (100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
 might've started a few seconds sooner, I'm watching top.
 
 I forgot to watch IO stat until 10:56.  At this point, both OSDs are
 reading.  iostat reports that they're both doing ~100
 transactions/sec, reading ~1 MiBps, 0 writes.
 
 
 At 11:01:26, iostat reports that both osds are no longer consuming any
 disk I/O.  They both go for  30 seconds with 0 transactions, and 0
 kiB read/write.  There are small bumps of 2 transactions/sec for one
 second, then it's back to 0.
 
 
 At 11:02:41, the primary OSD gets kicked out by the monitors:
 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
 active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2
 op/s
 2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg
 stats for 903.825187seconds
 2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in
 
 Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range).
 
 
 At ~11:10, I see that osd.11 has resumed reading from disk at the
 original levels (~100 tps, ~1MiBps read, 0 MiBps write).  Since it's
 down, but doing something, I let it run.
 
 Both the osd.11 and osd.0 repeat this pattern.  Reading for a while at
 ~1 MiBps, then nothing.  The duty cycle seems about 50%, with a 20
 minute period, but I haven't timed anything.  CPU usage remains at
 100%, regardless of whether IO is happening or not.
 
 
 At 12:24:15, osd.11 rejoins the cluster:
 2014-06-10 12:24:15.294646 mon.0 [INF] osd.11 10.193.0.7:6804/7100 boot
 2014-06-10 12:24:15.294725 mon.0 [INF] osdmap e58838: 36 osds: 35 up, 36 in
 2014-06-10 12:24:15.343869 mon.0 [INF] pgmap v8832827: 2560 pgs: 1
 stale+active+clean+scrubbing+deep, 2266 active+clean, 5
 stale+active+clean, 287 active+degraded, 1 active+clean+scrubbing;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 15650 B/s rd,
 18 op/s; 3617854/61758142 objects degraded (5.858%)
 
 
 osd.0's CPU usage drops back to normal when osd.11 rejoins the
 cluster.  The PG stats have not changed.   The last_deep_scrub and
 deep_scrub_stamp are still 48009'1904117 2014-05-21 07:28:01.315996
 respectively.
 
 
 This time, osd.0 did not get kicked out by the monitors.  In previous
 attempts, osd.0 was kicked out 5-10 minutes after osd.11.  When that
 happens, osd.0 rejoins the cluster after osd.11.
 
 
 I have several more PGs exhibiting the same behavior.  At least 3 that
 I know of, and many more that I haven't attempted to deep-scrub.
 
 
 
 
 
 
 ceph -v: ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 ceph.conf: https://cd.centraldesktop.com/p/eAAADvxuAHJRUk4
 ceph-osd.11.log (5.7 MiB):
 https://cd.centraldesktop.com/p/eAAADvxyABPwaeM
 ceph-osd.0.log (6.3 MiB):
 https://cd.centraldesktop.com/p/eAAADvx0ADWEGng
 ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADvxvAAylTW0
 
 (the pg query was collected at 13:24, after the above events)
 
 
 
 
 Things that probably don't matter:
 The OSD partitions were created using ceph-disk-prepare --dmcrypt.
 --
 To unsubscribe from this list: send the line unsubscribe 

Re: [ceph-users] I have PGs that I can't deep-scrub

2014-06-11 Thread Craig Lewis
New logs, with debug ms = 1, debug osd = 20.


In this timeline, I started the deep-scrub at 11:04:00  Ceph start
deep-scrubing at 11:04:03.

osd.11 started consuming 100% CPU around 11:07.  Same for osd.0.  CPU
usage is all user; iowait is  0.10%.  There is more variance in the
CPU usage now, ranging between 98.5% and 101.2%

This time, I didn't see any major IO, read or write.

osd.11 was marked down at 11:22:00:
2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg
stats for 902.656777seconds

osd.0 was marked down at 11:36:00:
 2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg
stats for 902.498894seconds




ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE
ceph-osd.0.log.gz (140MiB, 18MiB compressed):
https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ
ceph-osd.11.log.gz (131MiB, 17MiB compressed):
https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI
ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc





On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil s...@inktank.com wrote:
 Hi Craig,

 It's hard to say what is going wrong with that level of logs.  Can you
 reproduce with debug ms = 1 and debug osd = 20?

 There were a few things fixed in scrub between emperor and firefly.  Are
 you planning on upgrading soon?

 sage


 On Tue, 10 Jun 2014, Craig Lewis wrote:

 Every time I deep-scrub one PG, all of the OSDs responsible get kicked
 out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
 fails the same way every time.  OSD logs are linked at the bottom.

 What can I do to get this deep-scrub to complete cleanly?

 This is the first time I've deep-scrubbed these PGs since Sage helped
 me recover from some OSD problems
 (http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html)

 I can trigger the issue easily in this cluster, but have not been able
 to re-create in other clusters.






 The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
 are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
 owned by OSDs [11,0]

 This is a secondary cluster, so I stopped all external I/O on it.  I
 set nodeep-scrub, and restarted both OSDs with:
   debug osd = 5/5
   debug filestore = 5/5
   debug journal = 1
   debug monc = 20/20

 then I ran a deep-scrub on this PG.

 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
 active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
 77870 GB / 130 TB avail
 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
 active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail


 At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
 (100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
 might've started a few seconds sooner, I'm watching top.

 I forgot to watch IO stat until 10:56.  At this point, both OSDs are
 reading.  iostat reports that they're both doing ~100
 transactions/sec, reading ~1 MiBps, 0 writes.


 At 11:01:26, iostat reports that both osds are no longer consuming any
 disk I/O.  They both go for  30 seconds with 0 transactions, and 0
 kiB read/write.  There are small bumps of 2 transactions/sec for one
 second, then it's back to 0.


 At 11:02:41, the primary OSD gets kicked out by the monitors:
 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
 active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2
 op/s
 2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg
 stats for 903.825187seconds
 2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in

 Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range).


 At ~11:10, I see that osd.11 has resumed reading from disk at the
 original levels (~100 tps, ~1MiBps read, 0 MiBps write).  Since it's
 down, but doing something, I let it run.

 Both the osd.11 and osd.0 repeat this pattern.  Reading for a while at
 ~1 MiBps, then nothing.  The duty cycle seems about 50%, with a 20
 minute period, but I haven't timed anything.  CPU usage remains at
 100%, regardless of whether IO is happening or not.


 At 12:24:15, osd.11 rejoins the cluster:
 2014-06-10 12:24:15.294646 mon.0 [INF] osd.11 10.193.0.7:6804/7100 boot
 2014-06-10 12:24:15.294725 mon.0 [INF] osdmap e58838: 36 osds: 35 up, 36 in
 2014-06-10 12:24:15.343869 mon.0 [INF] pgmap v8832827: 2560 pgs: 1
 stale+active+clean+scrubbing+deep, 2266 active+clean, 5
 stale+active+clean, 287 active+degraded, 1 active+clean+scrubbing;
 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 15650 B/s rd,
 18 op/s; 3617854/61758142 objects degraded (5.858%)


 osd.0's CPU usage drops back to normal when osd.11 rejoins the
 cluster.  The PG stats have not changed.   The last_deep_scrub and