Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-27 Thread Mikaƫl Cluseau

Hi,

On 03/18/2015 03:01 PM, Gregory Farnum wrote:

I think it tended to crash rather than
hang like this so I'm a bit surprised, but if this op is touching a
broken file or something that could explain it.


FWIW, the last time I had the issue (on a 3.10.9 kernel), btrfs was 
freezing, waiting forever and even starting threads rising the host's 
load to numbers like 400. I can't say much more for now as I had to move 
to XFS. I will setup a lab someday to try btrfs again, has I heard it 
was more stable since 3.16.1, but I think I won't enable compression. 
BTW, it's really a filesystem I would like to use in production someday 
(as many I think ;)).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-26 Thread Chris Murray
That's fair enough Greg, I'll keep upgrading when the opportunity arises, and 
maybe it'll spring back to life someday :-)

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 20 March 2015 23:05
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote:
 Ah, I was wondering myself if compression could be causing an issue, but I'm 
 reconsidering now. My latest experiment should hopefully help troubleshoot.

 So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I 
 try that:

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) 
 -exec btrfs filesystem defragment -v -czlib -- {} +

 After much, much waiting, all files have been rewritten, but the OSD still 
 gets stuck at the same point.

 I've now unset the compress attribute on all files and started the defragment 
 process again, but I'm not too hopeful since the files must be 
 readable/writeable if I didn't get some failure during the defrag process.

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) 
 -exec chattr -c -- {} + find /var/lib/ceph/osd/ceph-1/current -xdev \( 
 -type f -o -type d \) -exec btrfs filesystem defragment -v -- {} +

 (latter command still running)

 Any other ideas at all? In the absence of the problem being spelled out to me 
 with an error of some sort, I'm not sure how to troubleshoot further.

Not much, sorry.

 Is it safe to upgrade a problematic cluster, when the time comes, in case 
 this ultimately is a CEPH bug which is fixed in something later than 0.80.9?

In general it should be fine since we're careful about backwards compatibility, 
but without knowing the actual issue I can't promise anything.
-Greg

-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5751 / Virus Database: 4306/9314 - Release Date: 03/16/15
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-20 Thread Chris Murray
Ah, I was wondering myself if compression could be causing an issue, but I'm 
reconsidering now. My latest experiment should hopefully help troubleshoot.

So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I try 
that:

find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
btrfs filesystem defragment -v -czlib -- {} +

After much, much waiting, all files have been rewritten, but the OSD still gets 
stuck at the same point.

I've now unset the compress attribute on all files and started the defragment 
process again, but I'm not too hopeful since the files must be 
readable/writeable if I didn't get some failure during the defrag process.

find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
chattr -c -- {} +
find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
btrfs filesystem defragment -v -- {} +

(latter command still running)

Any other ideas at all? In the absence of the problem being spelled out to me 
with an error of some sort, I'm not sure how to troubleshoot further.
Is it safe to upgrade a problematic cluster, when the time comes, in case this 
ultimately is a CEPH bug which is fixed in something later than 0.80.9?

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 18 March 2015 14:01
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray chrismurra...@gmail.com wrote:
 Hi again Greg :-)

 No, it doesn't seem to progress past that point. I started the OSD again a 
 couple of nights ago:

 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 
 13288339 open_ops 1 - 0, max_applied_seq 13288338 - 13288339
 2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 
 0, op_seq now 13288339
 2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 
 3951706112 : seq 13288340 1755 bytes
 2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: 
 applying op seq 13288340
 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 
 13288340 open_ops 0 - 1
 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 
 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
 meta/16ef7597/infos/head//-1
 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 
 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore   header.spos 
 0.0.0
 2015-03-16 21:34:48.096002 7fe49a5ac700 20 
 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start 
 max_applied_seq 13288339, open_ops 1
 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start 
 waiting for 1 open ops to drain

 There's the success line for 13288339, like you mentioned. But not one for 
 13288340.

 Intriguing. So, those same 1755 bytes seem problematic every time the journal 
 is replayed? Interestingly, there is a lot (in time, not exactly data mass or 
 IOPs, but still more than 1755 bytes!) of activity while the log is at this 
 line:

 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start 
 waiting for 1 open ops to drain

 ... but then the IO ceases and the log still doesn't go any further. I wonder 
 why 13288339 doesn't have that same  'waiting for ... open ops to drain' 
 line. Or the 'woke after' one for that matter.

 While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:

 sdb  20.00 0.00  3404.00  0   3404
 sdb  16.00 0.00  2100.00  0   2100
 sdb  10.00 0.00  1148.00  0   1148
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   1.00 0.00   496.00  0496
 sdb  32.00 0.00  4940.00  0   4940
 sdb   8.00 0.00  1144.00  0   1144
 sdb   1.00 0.00 4.00  0  4
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-20 Thread Gregory Farnum
On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote:
 Ah, I was wondering myself if compression could be causing an issue, but I'm 
 reconsidering now. My latest experiment should hopefully help troubleshoot.

 So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I 
 try that:

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 btrfs filesystem defragment -v -czlib -- {} +

 After much, much waiting, all files have been rewritten, but the OSD still 
 gets stuck at the same point.

 I've now unset the compress attribute on all files and started the defragment 
 process again, but I'm not too hopeful since the files must be 
 readable/writeable if I didn't get some failure during the defrag process.

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 chattr -c -- {} +
 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec 
 btrfs filesystem defragment -v -- {} +

 (latter command still running)

 Any other ideas at all? In the absence of the problem being spelled out to me 
 with an error of some sort, I'm not sure how to troubleshoot further.

Not much, sorry.

 Is it safe to upgrade a problematic cluster, when the time comes, in case 
 this ultimately is a CEPH bug which is fixed in something later than 0.80.9?

In general it should be fine since we're careful about backwards
compatibility, but without knowing the actual issue I can't promise
anything.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-18 Thread Chris Murray
 further. 

On another note, I'm not getting my own e-mails back to myself from the list 
anymore. In case I should be troubleshooting this side, is/isn't this happening 
to others?

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 16 March 2015 20:40
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

On Sat, Mar 14, 2015 at 1:56 AM, Chris Murray chrismurra...@gmail.com wrote:
 Good evening all,

 Just had another quick look at this with some further logging on and thought 
 I'd post the results in case anyone can keep me moving in the right direction.

 Long story short, some OSDs just don't appear to come up after one failing 
 after another. Dealing with one in isolation, after a load of IO, it never 
 starts. The last in the log looks like this:

 ...
 2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore   header.spos 
 0.0.0
 2015-03-13 18:43:11.876568 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeyrange 
 meta/39e3fb/pglog_4.57c/0//-1 
 [00.,4294967295.18446744073709551615]
 2015-03-13 18:43:11.876598 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator 
 meta/39e3fb/pglog_4.57c/0//-1
 2015-03-13 18:43:11.952511 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys 
 meta/39e3fb/pglog_4.57c/0//-1
 2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: 
 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
 2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore   header.spos 
 0.0.0
 2015-03-13 18:43:11.961127 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys 
 meta/39e3fb/pglog_4.57c/0//-1
 2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: 
 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
 2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore   header.spos 
 0.0.0
 2015-03-13 18:43:11.965687 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
 meta/39e3fb/pglog_4.57c/0//-1
 2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: 
 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
 2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore   header.spos 
 0.0.0
 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 
 13288339 open_ops 1 - 0, max_applied_seq 13288338 - 13288339
 2015-03-13 18:43:11.989861 7f29d1e98780  3 journal journal_replay: r = 
 0, op_seq now 13288339
 2015-03-13 18:43:11.989896 7f29d1e98780  2 journal read_entry 
 3951706112 : seq 13288340 1755 bytes
 2015-03-13 18:43:11.989900 7f29d1e98780  3 journal journal_replay: 
 applying op seq 13288340
 2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start 
 13288340 open_ops 0 - 1
 2015-03-13 18:43:11.989906 7f29d1e98780 10 
 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x2750480
 2015-03-13 18:43:11.989919 7f29d1e98780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
 meta/16ef7597/infos/head//-1
 2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: 
 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
 2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore   header.spos 
 0.0.0
 2015-03-13 18:43:15.404558 7f29c4439700 20 
 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000217
 2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start 
 max_applied_seq 13288339, open_ops 1
 2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start 
 waiting for 1 open ops to drain

 What might this 'open op' mean when it never seems to finish 'draining'? 
 Could my suspicions be true that it's somehow a BTRFS funny?

Well, this line:

 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 
 13288339 open_ops 1 - 0, max_applied_seq 13288338 - 13288339

is the system recognizing that an op has finished. Is that last one not 
followed by a similar line?

In any case, yes, btrfs misbehaving stupendously badly could cause things to 
hang, if eg it's just not finishing a write that the OSD is trying to put in. 
Although I'd naively expect some dmesg output if that were the case.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-18 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray chrismurra...@gmail.com wrote:
 Hi again Greg :-)

 No, it doesn't seem to progress past that point. I started the OSD again a 
 couple of nights ago:

 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 
 open_ops 1 - 0, max_applied_seq 13288338 - 13288339
 2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 0, 
 op_seq now 13288339
 2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 3951706112 : 
 seq 13288340 1755 bytes
 2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: applying 
 op seq 13288340
 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 
 open_ops 0 - 1
 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 
 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 
 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 
 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore   header.spos 0.0.0
 2015-03-16 21:34:48.096002 7fe49a5ac700 20 
 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start 
 max_applied_seq 13288339, open_ops 1
 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 
 open ops to drain

 There's the success line for 13288339, like you mentioned. But not one for 
 13288340.

 Intriguing. So, those same 1755 bytes seem problematic every time the journal 
 is replayed? Interestingly, there is a lot (in time, not exactly data mass or 
 IOPs, but still more than 1755 bytes!) of activity while the log is at this 
 line:

 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 
 open ops to drain

 ... but then the IO ceases and the log still doesn't go any further. I wonder 
 why 13288339 doesn't have that same  'waiting for ... open ops to drain' 
 line. Or the 'woke after' one for that matter.

 While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:

 sdb  20.00 0.00  3404.00  0   3404
 sdb  16.00 0.00  2100.00  0   2100
 sdb  10.00 0.00  1148.00  0   1148
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   1.00 0.00   496.00  0496
 sdb  32.00 0.00  4940.00  0   4940
 sdb   8.00 0.00  1144.00  0   1144
 sdb   1.00 0.00 4.00  0  4
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb  17.00 0.00  3340.00  0   3340
 sdb  23.00 0.00  3368.00  0   3368
 sdb   1.00 0.00 4.00  0  4
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb   0.00 0.00 0.00  0  0
 sdb  13.00 0.00  3332.00  0   3332
 sdb  18.00 0.00  2360.00  0   2360
 sdb  59.00 0.00  7464.00  0   7464
 sdb   0.00 0.00 0.00  0  0

 I was hoping Google may have held some clues, but it seems I'm the only one 
 :-)

 https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22

 I tried removing compress-force=lzo from osd mount options btrfs in 
 ceph.conf, in case it was the 

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-14 Thread Chris Murray
Good evening all,

Just had another quick look at this with some further logging on and thought 
I'd post the results in case anyone can keep me moving in the right direction. 

Long story short, some OSDs just don't appear to come up after one failing 
after another. Dealing with one in isolation, after a load of IO, it never 
starts. The last in the log looks like this:

...
2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore   header.spos 0.0.0
2015-03-13 18:43:11.876568 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_rmkeyrange meta/39e3fb/pglog_4.57c/0//-1 
[00.,4294967295.18446744073709551615]
2015-03-13 18:43:11.876598 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
get_omap_iterator meta/39e3fb/pglog_4.57c/0//-1
2015-03-13 18:43:11.952511 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore   header.spos 0.0.0
2015-03-13 18:43:11.961127 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore   header.spos 0.0.0
2015-03-13 18:43:11.965687 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_setkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore   header.spos 0.0.0
2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 13288339 
open_ops 1 - 0, max_applied_seq 13288338 - 13288339
2015-03-13 18:43:11.989861 7f29d1e98780  3 journal journal_replay: r = 0, 
op_seq now 13288339
2015-03-13 18:43:11.989896 7f29d1e98780  2 journal read_entry 3951706112 : seq 
13288340 1755 bytes
2015-03-13 18:43:11.989900 7f29d1e98780  3 journal journal_replay: applying op 
seq 13288340
2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start 13288340 
open_ops 0 - 1
2015-03-13 18:43:11.989906 7f29d1e98780 10 filestore(/var/lib/ceph/osd/ceph-1) 
_do_transaction on 0x2750480
2015-03-13 18:43:11.989919 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_setkeys meta/16ef7597/infos/head//-1
2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: 
16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore   header.spos 0.0.0
2015-03-13 18:43:15.404558 7f29c4439700 20 filestore(/var/lib/ceph/osd/ceph-1) 
sync_entry woke after 5.000217
2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start max_applied_seq 
13288339, open_ops 1
2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start waiting for 1 
open ops to drain

What might this 'open op' mean when it never seems to finish 'draining'? Could 
my suspicions be true that it's somehow a BTRFS funny?

Thanks again,
Chris

-Original Message-
From: Chris Murray 
Sent: 03 March 2015 09:45
To: Gregory Farnum
Cc: ceph-users
Subject: RE: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

Ah yes, that's a good point :-)

Thank you for your assistance Greg, I'm understanding a little more about how 
Ceph operates under the hood now.

We're probably at a reasonable point for me to say I'll just switch the 
machines off and forget about them for a while. It's no great loss; I just 
wanted to see if the cluster would come back to life despite any mis-treatment, 
and how far it can be pushed with the limited resources on the Microservers.

Getting to the admin socket fails:

root@ceph26:~# ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok help
admin_socket: exception getting command descriptions: [Errno 111] Connection 
refused

And after activity ceased on /dev/sdb ...

(60 second intervals again, snipped many hours of these sorts of figures)
sdb   5.52 0.00   801.27  0  48076
sdb   4.68 0.00   731.80  0  43908
sdb   5.25 0.00   792.80  0  47568
sdb  18.83   483.07   569.53  28984  34172
sdb  28.28   894.6035.40  53676   2124
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0

... the log hadn't progressed beyond the below. Note the last entry was 13 
hours prior to activity on sdb ending, so whatever finished writing (then 
momentarily reading) this morning, didn't add anything to the log.

...
2015-03-02 18:24:45.942970

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-03 Thread Chris Murray
Ah yes, that's a good point :-)

Thank you for your assistance Greg, I'm understanding a little more about how 
Ceph operates under the hood now.

We're probably at a reasonable point for me to say I'll just switch the 
machines off and forget about them for a while. It's no great loss; I just 
wanted to see if the cluster would come back to life despite any mis-treatment, 
and how far it can be pushed with the limited resources on the Microservers.

Getting to the admin socket fails:

root@ceph26:~# ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok help
admin_socket: exception getting command descriptions: [Errno 111] Connection 
refused

And after activity ceased on /dev/sdb ...

(60 second intervals again, snipped many hours of these sorts of figures)
sdb   5.52 0.00   801.27  0  48076
sdb   4.68 0.00   731.80  0  43908
sdb   5.25 0.00   792.80  0  47568
sdb  18.83   483.07   569.53  28984  34172
sdb  28.28   894.6035.40  53676   2124
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0

... the log hadn't progressed beyond the below. Note the last entry was 13 
hours prior to activity on sdb ending, so whatever finished writing (then 
momentarily reading) this morning, didn't add anything to the log.

...
2015-03-02 18:24:45.942970 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) 
get_omap_iterator meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.977857 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.978400 7f27f03ef780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
2015-03-02 18:24:45.978414 7f27f03ef780 10 filestore   header.spos 0.0.0
2015-03-02 18:24:45.986763 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.987350 7f27f03ef780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
2015-03-02 18:24:45.987363 7f27f03ef780 10 filestore   header.spos 0.0.0
2015-03-02 18:24:45.991651 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_setkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.992119 7f27f03ef780 10 filestore oid: 
39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
2015-03-02 18:24:45.992128 7f27f03ef780 10 filestore   header.spos 0.0.0
2015-03-02 18:24:46.016116 7f27f03ef780 10 filestore(/var/lib/ceph/osd/ceph-1) 
_do_transaction on 0x1a92540
2015-03-02 18:24:46.016133 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) 
_omap_setkeys meta/16ef7597/infos/head//-1
2015-03-02 18:24:46.016542 7f27f03ef780 10 filestore oid: 
16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
2015-03-02 18:24:46.016555 7f27f03ef780 10 filestore   header.spos 0.0.0
2015-03-02 18:24:48.855098 7f27e2fe0700 20 filestore(/var/lib/ceph/osd/ceph-1) 
sync_entry woke after 5.000291

The complete file is attached, in case it's of interest to anyone.

I get the feeling it's BTRFS which is the 'cause' here. I'm running a scrub in 
case it highlights anything wrong with the filesystem. If it all springs back 
to life, I'll post back here with my findings!

Thanks again for the pointers,
Chris

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 02 March 2015 18:05
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

You can turn the filestore up to 20 instead of 1. ;) You might also explore 
what information you can get out of the admin socket.

You are correct that those numbers are the OSD epochs, although note that when 
the system is running you'll get output both for the OSD as a whole and for 
individual PGs within it (which can be lagging behind). I'm still pretty 
convinced the OSDs are simply stuck trying to bring their PGs up to date and 
are thrashing the maps on disk, but we're well past what I can personally 
diagnose without log diving.
-Greg

On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray chrismurra...@gmail.com wrote:
 After noticing that the number increases by 101 on each attempt to 
 start osd.11, I figured I was only 7 iterations away from the output 
 being within 101 of 63675. So, I killed the osd process, started it 
 again, lather, rinse, repeat. I then did the same for other OSDs. Some 
 created very small logs, and some created logs into the gigabytes. 
 Grepping the latter for update_osd_stat showed me where the maps 
 were up to, and therefore which OSDs needed some special attention. 
 Some of the epoch numbers appeared to increase by themselves to a 
 point

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-02 Thread Gregory Farnum
You can turn the filestore up to 20 instead of 1. ;) You might also
explore what information you can get out of the admin socket.

You are correct that those numbers are the OSD epochs, although note
that when the system is running you'll get output both for the OSD as
a whole and for individual PGs within it (which can be lagging
behind). I'm still pretty convinced the OSDs are simply stuck trying
to bring their PGs up to date and are thrashing the maps on disk, but
we're well past what I can personally diagnose without log diving.
-Greg

On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray chrismurra...@gmail.com wrote:
 After noticing that the number increases by 101 on each attempt to start
 osd.11, I figured I was only 7 iterations away from the output being
 within 101 of 63675. So, I killed the osd process, started it again,
 lather, rinse, repeat. I then did the same for other OSDs. Some created
 very small logs, and some created logs into the gigabytes. Grepping the
 latter for update_osd_stat showed me where the maps were up to, and
 therefore which OSDs needed some special attention. Some of the epoch
 numbers appeared to increase by themselves to a point and then plateaux,
 after which I'd kill then start the osd again, and this number would
 start to increase again.

 After all either showed 63675, or nothing at all, I turned debugging
 back off, deleted logs, and tried to bring the cluster back by unsetting
 noup, nobackfill, norecovery etc. It hasn't got very far before
 appearing stuck again, with nothing progressing in ceph status. It
 appears that 11/15 OSDs are now properly up, but four still aren't. A
 lot of placement groups are stale, so I guess I really need the
 remaining four to come up.

 The OSDs in question are 1, 7, 10  12. All have a line similar to this
 as the last in their log:

 2015-02-28 10:35:04.240822 7f375ef40780  1 journal _open
 /var/lib/ceph/osd/ceph-1/journal fd 21: 5367660544 bytes, block size
 4096 bytes, directio = 1, aio = 1

 Even with the following in ceph.conf, I'm not seeing anything after that
 last line in the log.

  debug osd = 20
  debug filestore = 1

 CPU is still being consumed by the ceph-osd process though, but not much
 memory is being used compared to the other two OSDs which are up on that
 node.

 Is there perhaps even further logging that I can use to see why the logs
 aren't progressing past this point?
 Osd.1 is on /dev/sdb. iostat still shows some activity as the minutes go
 on, but not much:

 (60 second intervals)
 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
 sdb   5.45 0.00   807.33  0  48440
 sdb   5.75 0.00   807.33  0  48440
 sdb   5.43 0.00   807.20  0  48440

 Thanks,
 Chris

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Chris Murray
 Sent: 27 February 2015 10:32
 To: Gregory Farnum
 Cc: ceph-users
 Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
 the cluster recover without help?

 A little further logging:

 2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:15.745619 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:23.530913 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26380 con 0xe1f0cc60
 2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:30.645938 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:33.531142 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26540 con 0xe1f0cc60
 2015-02-27 10:27:43.531333 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f26700 con 0xe1f0cc60
 2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:45.546311 7fe8e3f2f700  5 osd.11 62839 heartbeat:
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:53.531564 7fe8e8536700  1 -- 192.168.12.25:6800/673078
 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
 v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60
 2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
 osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
 [])
 2015-02-27 10:27:56.846627 7fe8e3f2f700  5 osd.11 62839

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-28 Thread Chris Murray
After noticing that the number increases by 101 on each attempt to start
osd.11, I figured I was only 7 iterations away from the output being
within 101 of 63675. So, I killed the osd process, started it again,
lather, rinse, repeat. I then did the same for other OSDs. Some created
very small logs, and some created logs into the gigabytes. Grepping the
latter for update_osd_stat showed me where the maps were up to, and
therefore which OSDs needed some special attention. Some of the epoch
numbers appeared to increase by themselves to a point and then plateaux,
after which I'd kill then start the osd again, and this number would
start to increase again.

After all either showed 63675, or nothing at all, I turned debugging
back off, deleted logs, and tried to bring the cluster back by unsetting
noup, nobackfill, norecovery etc. It hasn't got very far before
appearing stuck again, with nothing progressing in ceph status. It
appears that 11/15 OSDs are now properly up, but four still aren't. A
lot of placement groups are stale, so I guess I really need the
remaining four to come up.

The OSDs in question are 1, 7, 10  12. All have a line similar to this
as the last in their log:

2015-02-28 10:35:04.240822 7f375ef40780  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 21: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1

Even with the following in ceph.conf, I'm not seeing anything after that
last line in the log.

 debug osd = 20
 debug filestore = 1

CPU is still being consumed by the ceph-osd process though, but not much
memory is being used compared to the other two OSDs which are up on that
node.

Is there perhaps even further logging that I can use to see why the logs
aren't progressing past this point? 
Osd.1 is on /dev/sdb. iostat still shows some activity as the minutes go
on, but not much:

(60 second intervals)
Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb   5.45 0.00   807.33  0  48440
sdb   5.75 0.00   807.33  0  48440
sdb   5.43 0.00   807.20  0  48440

Thanks,
Chris

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Chris Murray
Sent: 27 February 2015 10:32
To: Gregory Farnum
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
the cluster recover without help?

A little further logging:

2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:15.745619 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:23.530913 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26380 con 0xe1f0cc60
2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:30.645938 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:33.531142 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26540 con 0xe1f0cc60
2015-02-27 10:27:43.531333 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26700 con 0xe1f0cc60
2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:45.546311 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:53.531564 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60
2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:56.846627 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:57.346965 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:57.347001 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:28:03.531785 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26a80 con 0xe1f0cc60
2015-02-27 10:28:13.532027 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-27 Thread Chris Murray
That's interesting, it seems to be alternating between two lines, but only one 
thread this time? I'm guessing the 62738 is the osdmap, which is much behind 
where it should be? Osd.0 and osd.3 are on 63675, if I'm understanding that 
correctly.

2015-02-27 08:18:48.724645 7f2fbd1e8700 20 osd.11 62738 update_osd_stat 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:18:48.724683 7f2fbd1e8700  5 osd.11 62738 heartbeat: 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:00.025003 7f2fbd1e8700 20 osd.11 62738 update_osd_stat 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:00.025040 7f2fbd1e8700  5 osd.11 62738 heartbeat: 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:04.125395 7f2fbd1e8700 20 osd.11 62738 update_osd_stat 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:04.125431 7f2fbd1e8700  5 osd.11 62738 heartbeat: 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:26.225763 7f2fbd1e8700 20 osd.11 62738 update_osd_stat 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:26.225797 7f2fbd1e8700  5 osd.11 62738 heartbeat: 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:26.726140 7f2fbd1e8700 20 osd.11 62738 update_osd_stat 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])
2015-02-27 08:19:26.726177 7f2fbd1e8700  5 osd.11 62738 heartbeat: 
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist [])

Activity on /dev/sdb looks similar to how it did without debugging:

sdb   5.95 0.00   701.20  0  42072
sdb   5.10 0.00   625.60  0  37536
sdb   4.97 0.00   611.33  0  36680
sdb   5.77 0.00   701.20  0  42072


Some Googling reveals references to log files which have very similar entries, 
but I can't see anything that just repeats like mine does.

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 26 February 2015 22:37
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

If you turn up debug osd = 20 or something it'll apply a good bit more disk 
load but give you more debugging logs about what's going on.
It could be that you're in enough of a mess now that it's stuck trying to 
calculate past intervals for a bunch of PGs across so many maps that it's 
swapping things in and out of memory and going slower (if that's the case, then 
increasing the size of your map cache will help).
-Greg

On Thu, Feb 26, 2015 at 1:56 PM, Chris Murray chrismurra...@gmail.com wrote:
 Tackling this on a more piecemeal basis, I've stopped all OSDs, and 
 started just the three which exist on the first host.

 osd.0 comes up without complaint: osd.0 63675 done with init, 
 starting boot process
 osd.3 comes up without complaint: osd.3 63675 done with init, 
 starting boot process
 osd.11 is a problematic one.

 It does something like this ...

 2015-02-26 10:44:50.260593 7f7e23551780  0 ceph version 0.80.8 
 (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 
 305080
 2015-02-26 10:44:50.265525 7f7e23551780  0
 filestore(/var/lib/ceph/osd/ceph-11) mount detected btrfs
 2015-02-26 10:44:51.155501 7f7e23551780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
 FIEMAP ioctl is supported and appears to work
 2015-02-26 10:44:51.155536 7f7e23551780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
 FIEMAP ioctl is disabled via 'filestore fiemap' config option
 2015-02-26 10:44:51.433239 7f7e23551780  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
 syscall(SYS_syncfs, fd) fully supported
 2015-02-26 10:44:51.433467 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 CLONE_RANGE ioctl is supported
 2015-02-26 10:44:51.644373 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 SNAP_CREATE is supported
 2015-02-26 10:44:51.668424 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 SNAP_DESTROY is supported
 2015-02-26 10:44:51.668741 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 START_SYNC is supported (transid 43205)
 2015-02-26 10:44:51.766577 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 WAIT_SYNC is supported
 2015-02-26 10:44:51.814761 7f7e23551780  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
 SNAP_CREATE_V2 is supported
 2015-02-26 10:44:52.181382 7f7e23551780  0
 filestore(/var/lib/ceph/osd/ceph-11) mount

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-27 Thread Chris Murray
A little further logging:

2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:15.745619 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:23.530913 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26380 con 0xe1f0cc60
2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:30.645938 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:33.531142 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26540 con 0xe1f0cc60
2015-02-27 10:27:43.531333 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26700 con 0xe1f0cc60
2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:45.546311 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:53.531564 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60
2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:56.846627 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:57.346965 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:27:57.347001 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:28:03.531785 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26a80 con 0xe1f0cc60
2015-02-27 10:28:13.532027 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26c40 con 0xe1f0cc60
2015-02-27 10:28:23.047382 7fe8e3f2f700 20 osd.11 62839 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:28:23.047419 7fe8e3f2f700  5 osd.11 62839 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 10:28:23.532271 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26e00 con 0xe1f0cc60
2015-02-27 10:28:33.532496 7fe8e8536700  1 -- 192.168.12.25:6800/673078
-- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0})
v2 -- ?+0 0xe5f26fc0 con 0xe1f0cc60

62839? But it was 62738 earlier, so it is actually advancing toward the
63675? If what I've assumed about the osd map numbers is true.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Chris Murray
Sent: 27 February 2015 08:33
To: Gregory Farnum
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
the cluster recover without help?

That's interesting, it seems to be alternating between two lines, but
only one thread this time? I'm guessing the 62738 is the osdmap, which
is much behind where it should be? Osd.0 and osd.3 are on 63675, if I'm
understanding that correctly.

2015-02-27 08:18:48.724645 7f2fbd1e8700 20 osd.11 62738 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:18:48.724683 7f2fbd1e8700  5 osd.11 62738 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:00.025003 7f2fbd1e8700 20 osd.11 62738 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:00.025040 7f2fbd1e8700  5 osd.11 62738 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:04.125395 7f2fbd1e8700 20 osd.11 62738 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:04.125431 7f2fbd1e8700  5 osd.11 62738 heartbeat:
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:26.225763 7f2fbd1e8700 20 osd.11 62738 update_osd_stat
osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist
[])
2015-02-27 08:19:26.225797 7f2fbd1e8700  5 osd.11 62738 heartbeat:
osd_stat(1305 GB used

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-26 Thread Chris Murray
 10:46:26.316307 7f7e23551780 -1 osd.11 61425
set_disk_tp_priority(22) Invalid argument: osd_disk_thread_ioprio_class
is  but only the following values are allowed: idle, be or rt
2015-02-26 10:46:26.320354 7f7df1017700  0 osd.11 61425 ignoring osdmap
until we have initialized
2015-02-26 10:46:26.320431 7f7df1017700  0 osd.11 61425 ignoring osdmap
until we have initialized
2015-02-26 10:46:26.334867 7f7df544e700  0 -- 0.0.0.0:6801/305080 
192.168.12.27:6801/3135 pipe(0x2c9e500 sd=39 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d1e0).fault with nothing to send, going to standby
2015-02-26 10:46:26.334985 7f7df3378700  0 -- 0.0.0.0:6801/305080 
192.168.12.28:6811/3824 pipe(0x2c9ec80 sd=41 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d4a0).fault with nothing to send, going to standby
2015-02-26 10:46:26.335074 7f7df9a90700  0 -- 0.0.0.0:6801/305080 
192.168.12.26:6806/3574 pipe(0x2c9f400 sd=43 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d340).fault with nothing to send, going to standby
2015-02-26 10:46:26.335152 7f7df9b91700  0 -- 0.0.0.0:6801/305080 
192.168.12.26:6801/3188 pipe(0x2c9f180 sd=44 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d080).fault with nothing to send, going to standby
2015-02-26 10:46:26.341356 7f7df534d700  0 -- 0.0.0.0:6801/305080 
192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 l=0
c=0xe248d600).connect claims to be 0.0.0.0:6801/305080 not
192.168.12.25:6801/3318 - wrong node!
2015-02-26 10:46:26.341522 7f7df534d700  0 -- 0.0.0.0:6801/305080 
192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 l=0
c=0xe248d600).fault with nothing to send, going to standby
2015-02-26 10:46:26.467208 7f7df2233700  0 -- 0.0.0.0:6801/305080 
192.168.12.27:6806/3582 pipe(0xe3566000 sd=148 :0 s=1 pgs=0 cs=0 l=0
c=0xe355ac60).fault with nothing to send, going to standby
2015-02-26 10:46:26.467400 7f7df1ef4700  0 -- 0.0.0.0:6801/305080 
192.168.12.29:6801/3169 pipe(0xe3566500 sd=149 :0 s=1 pgs=0 cs=0 l=0
c=0xe355ab00).fault with nothing to send, going to standby

At this point, it's writing to /dev/sdb at up to about 100 IOPs, which
would be sensible for the drive in question: a WD Red 3TB. Over 60
second intervals, iostat looks a bit like this:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb   5.17 0.00   660.40  0  39624
sdb   5.40 0.00   688.27  0  41296
sdb   4.62 0.00   564.47  0  33868
sdb   4.95 0.00   655.13  0  39308
sdb   5.33 0.00   688.87  0  41332
sdb   5.05 0.00   663.47  0  39808

Every now and then there'll be a few kb of reads, but the rest of the
time it's just writing. /dev/sdc and /dev/sdd don't appear to be doing
that, despite also being formatted with BTRFS.

Then, after whatever I've set filestore op thread timeout to, these
messages start. I've tried 600, 3600, and even a whopping 32400 (9
hours, to fit troubleshooting into what's left of my day here in the UK)

2015-02-26 21:39:03.215672 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:03.215699 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
2015-02-26 21:39:08.215855 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:08.215883 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
2015-02-26 21:39:13.216038 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:13.216066 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400

Note the subtle alternation between two thread numbers. 

Even after the start of these timeouts, /dev/sdb is still having bursts
of write activity, which is interesting.

sdb   5.27 0.00   683.60  0  41016
sdb   5.20 0.00   660.40  0  39624
sdb   5.17 0.00   639.67  0  38380
sdb   5.00 0.53   617.33 32  37040

Can anyone comment on what might be causing this error for this osd?
Many years ago, when ZFS was in its infancy, I had a dedup disaster
which I thought would never end, but that just needed to do its thing
before the pool came back to life. Could this be a similar scenario
perhaps? Is the activity leading up to something, and BTRFS is slowly
doing what Ceph is asking of it, or is it just going round and round in
circles and I just can't see? :-)

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Chris Murray
Sent: 25 February 2015 12:58
To: Gregory Farnum
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-26 Thread Gregory Farnum
 comment on what might be causing this error for this osd?
 Many years ago, when ZFS was in its infancy, I had a dedup disaster
 which I thought would never end, but that just needed to do its thing
 before the pool came back to life. Could this be a similar scenario
 perhaps? Is the activity leading up to something, and BTRFS is slowly
 doing what Ceph is asking of it, or is it just going round and round in
 circles and I just can't see? :-)

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Chris Murray
 Sent: 25 February 2015 12:58
 To: Gregory Farnum
 Cc: ceph-users
 Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will
 the cluster recover without help?

 Thanks Greg

 After seeing some recommendations I found in another thread, my
 impatience got the better of me, and I've start the process again, but
 there is some logic, I promise :-) I've copied the process from Michael
 Kidd, I believe, and it goes along the lines of:

 setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
 stopping all OSDs setting all OSDs down  out setting various options in
 ceph.conf to limit backfill activity etc starting all OSDs wait until
 all CPU settles to 0%  -- I am here unset the noup flag wait until all
 CPU settles to 0% unset the noin flag wait until all CPU settles to 0%
 unset the nobackfill flag wait until all CPU settles to 0% unset the
 norecover flag remove options from ceph.conf unset the noscrub flag
 unset the nodeep-scrub flag


 Currently, host CPU usage is approx the following, so something's
 changed, and I'm tempted to leave things a little longer before my next
 step, just in case CPU does eventually stop spinning. I read reports of
 things taking a while even with modern Xeons, so I suppose it's not
 outside the realms of possibility that an AMD Neo might take days to
 work things out. We're up to 24.5 hours now:

 192.168.12.25   20%
 192.168.12.26   1%
 192.168.12.27   15%
 192.168.12.28   1%
 192.168.12.29   12%

 Interesting, as 192.168.12.26 and .28 are the two which stopped spinning
 before I restarted this process too.

 The number of different states is slightly less confusing now, but not
 by much: :-)

 788386/2591752 objects degraded (30.419%)
   90 stale+active+clean
2 stale+down+remapped+peering
2 stale+incomplete
1
 stale+active+degraded+remapped+wait_backfill+backfill_toofull
1 stale+degraded
 1255 stale+active+degraded
   32 stale+remapped+peering
  773 stale+active+remapped
4 stale+active+degraded+remapped+backfill_toofull
 1254 stale+down+peering
  278 stale+peering
   33 stale+active+remapped+backfill_toofull
  563 stale+active+degraded+remapped

 Well, you below indicate that osd.14's log says it crashed on an
 internal heartbeat timeout (usually, it got stuck waiting for disk IO or
 the kernel/btrfs hung), so that's why. The osd.12 process exists but
 isn't up; osd.14 doesn't even have a socket to connect to.

 Ah, that does make sense, thank you.

 That's not what I'd expect to see (it appears to have timed out and
 not be recognizing it?) but I don't look at these things too often so
 maybe that's the normal indication that heartbeats are failing.

 I'm not sure what this means either. A google for heartbeat_map
 is_healthy  FileStore::op_tp thread had timed out after doesn't return
 much, but I did see this quote from Sage on what looks like a similar
 matter:

 - the filestore op_queue is blocked on the throttler (too much io
 queued)
 - the commit thread is also waiting for ops to finish
 - i see no actual thread processing the op_queue Usually that's
 because it hit a kernel bug and got killed.  Not sure what else would
 make that thread disappear...
 sage

 Oh!

 Also, you want to find out why they're dying. That's probably the root

 cause of your issues

 I have a sneaking suspicion it's BTRFS, but don't have the evidence or
 perhaps the knowledge to prove it. If XFS did compression, I'd go with
 that, but at the moment I need to rely on compression to solve the
 problem of reclaiming space *within* files which reside on ceph. As far
 as I remember, overwriting with zeros didn't re-do the thin provisioning
 on XFS, if that makes sense.

 Thanks again,
 Chris
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-25 Thread Chris Murray
Thanks Greg

After seeing some recommendations I found in another thread, my impatience got 
the better of me, and I've start the process again, but there is some logic, I 
promise :-)
I've copied the process from Michael Kidd, I believe, and it goes along the 
lines of:

setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
stopping all OSDs
setting all OSDs down  out
setting various options in ceph.conf to limit backfill activity etc
starting all OSDs
wait until all CPU settles to 0%  -- I am here
unset the noup flag
wait until all CPU settles to 0%
unset the noin flag
wait until all CPU settles to 0%
unset the nobackfill flag
wait until all CPU settles to 0%
unset the norecover flag
remove options from ceph.conf
unset the noscrub flag
unset the nodeep-scrub flag


Currently, host CPU usage is approx the following, so something's changed, and 
I'm tempted to leave things a little longer before my next step, just in case 
CPU does eventually stop spinning. I read reports of things taking a while 
even with modern Xeons, so I suppose it's not outside the realms of possibility 
that an AMD Neo might take days to work things out. We're up to 24.5 hours now:

192.168.12.25   20%
192.168.12.26   1%
192.168.12.27   15%
192.168.12.28   1%
192.168.12.29   12%

Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before 
I restarted this process too.

The number of different states is slightly less confusing now, but not by much: 
:-)

788386/2591752 objects degraded (30.419%)
  90 stale+active+clean
   2 stale+down+remapped+peering
   2 stale+incomplete
   1 
stale+active+degraded+remapped+wait_backfill+backfill_toofull
   1 stale+degraded
1255 stale+active+degraded
  32 stale+remapped+peering
 773 stale+active+remapped
   4 stale+active+degraded+remapped+backfill_toofull
1254 stale+down+peering
 278 stale+peering
  33 stale+active+remapped+backfill_toofull
 563 stale+active+degraded+remapped

 Well, you below indicate that osd.14's log says it crashed on an internal 
 heartbeat timeout (usually, it got stuck waiting for disk IO or the 
 kernel/btrfs hung), so that's why. The osd.12 process exists but isn't up; 
 osd.14 doesn't even have a socket to connect to.

Ah, that does make sense, thank you.

 That's not what I'd expect to see (it appears to have timed out and not be 
 recognizing it?) but I don't look at these things too often so maybe that's 
 the normal indication that heartbeats are failing.

I'm not sure what this means either. A google for heartbeat_map is_healthy  
FileStore::op_tp thread had timed out after doesn't return much, but I did see 
this quote from Sage on what looks like a similar matter:

 - the filestore op_queue is blocked on the throttler (too much io queued)
 - the commit thread is also waiting for ops to finish
 - i see no actual thread processing the op_queue
 Usually that's because it hit a kernel bug and got killed.  Not sure what 
 else would make that thread disappear...
 sage

Oh!

 Also, you want to find out why they're dying. That's probably the root cause 
 of your issues

I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps 
the knowledge to prove it. If XFS did compression, I'd go with that, but at the 
moment I need to rely on compression to solve the problem of reclaiming space 
*within* files which reside on ceph. As far as I remember, overwriting with 
zeros didn't re-do the thin provisioning on XFS, if that makes sense.

Thanks again,
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-24 Thread Gregory Farnum
On Mon, Feb 23, 2015 at 8:59 AM, Chris Murray chrismurra...@gmail.com wrote:
 ... Trying to send again after reporting bounce backs to dreamhost ...
 ... Trying to send one more time after seeing mails come through the
 list today ...

 Hi all,

 First off, I should point out that this is a 'small cluster' issue and
 may well be due to the stretched resources. If I'm doomed to destroying
 and starting again, fair be it, but I'm interested to see if things can
 get up and running again.

 My experimental ceph cluster now has 5 nodes with 3 osds each. Some
 drives are big, some drives are small. Most are formatted with BTRFS and
 two are still formatted with XFS, which I intend to remove and recreate
 with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
 compression suits my use-case, so I'm prepared to stick with it while it
 matures. I had to set the following, to avoid osds dying as the IO was
 consumed by the snapshot creation and deletion process (as I understand
 it):

 filestore btrfs snap = false

 and the mount options look like this:

 osd mount options btrfs =
 rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo

 Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
 horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
 also a mon.

 My issue is: After adding the 15th osd, the cluster went into a spiral
 of destruction, with osds going down one after another. One might go
 down on occasion, and usually a start of the osd in question will remedy
 things. This time, though, it hasn't, and the problem appears to have
 become worse and worse. I've tried starting osds, restarting whole
 hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
 and noout. I've ceased rbd activity since it was getting blocked anyway.
 The cluster appears to now be 'stuck' in this state:

 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
 pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
 pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
 unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
 osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
  monmap e5: 5 mons at
 {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
 quorum 0,1,2,3,4 0,1,2,3,4
  osdmap e63276: 15 osds: 7 up, 15 in
 flags noup,nodown,noout
   pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
 4611 GB used, 871 GB / 5563 GB avail
 838948/2578420 objects degraded (32.537%)
3 down+remapped+peering
8 stale+active+degraded+remapped
   85 active+clean
1 stale+incomplete
 1088 stale+down+peering
  642 active+degraded+remapped
1 incomplete
   33 stale+remapped+peering
  135 down+peering
1 stale+degraded
1
 stale+active+degraded+remapped+wait_backfill+backfill_toofull
  854 active+remapped
  234 stale+active+degraded
4 active+degraded+remapped+backfill_toofull
   40 active+remapped+backfill_toofull
 1079 active+degraded
5 stale+active+clean
   74 stale+peering

 Take one of the nodes. It holds osds 12 (down  in), 13 (up  in) and 14
 (down  in).

 # ceph osd stat
  osdmap e63276: 15 osds: 7 up, 15 in
 flags noup,nodown,noout

 # ceph daemon osd.12 status
 no valid command found; 10 closest matches:
 config show
 help
 log dump
 get_command_descriptions
 git_version
 config set var val [val...]
 version
 2
 config get var
 0
 admin_socket: invalid command

 # ceph daemon osd.13 status
 { cluster_fsid: e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a,
   osd_fsid: d7794b10-2366-4c4f-bb4d-5f11098429b6,
   whoami: 13,
   state: active,
   oldest_map: 48214,
   newest_map: 63276,
   num_pgs: 790}

 # ceph daemon osd.14 status
 admin_socket: exception getting command descriptions: [Errno 111]
 Connection refused

 I'm assuming osds 12 and 14 are acting that way because they're not up,
 but why are they different?

Well, you below indicate that osd.14's log says it crashed on an
internal heartbeat timeout (usually, it got stuck waiting for disk IO
or the kernel/btrfs hung), so that's why. The osd.12 process exists
but isn't up; osd.14 doesn't even have a socket to connect to.


 In terms of logs, ceph-osd.12.log doesn't go beyond this:
 2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
 (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
 2015-02-22 10:38:29.639802 7fd24952c780  0
 filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
 

[ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-23 Thread Chris Murray
... Trying to send again after reporting bounce backs to dreamhost ...
... Trying to send one more time after seeing mails come through the
list today ...

Hi all,

First off, I should point out that this is a 'small cluster' issue and
may well be due to the stretched resources. If I'm doomed to destroying
and starting again, fair be it, but I'm interested to see if things can
get up and running again.

My experimental ceph cluster now has 5 nodes with 3 osds each. Some
drives are big, some drives are small. Most are formatted with BTRFS and
two are still formatted with XFS, which I intend to remove and recreate
with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
compression suits my use-case, so I'm prepared to stick with it while it
matures. I had to set the following, to avoid osds dying as the IO was
consumed by the snapshot creation and deletion process (as I understand
it):

filestore btrfs snap = false

and the mount options look like this:

osd mount options btrfs =
rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo

Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
also a mon.

My issue is: After adding the 15th osd, the cluster went into a spiral
of destruction, with osds going down one after another. One might go
down on occasion, and usually a start of the osd in question will remedy
things. This time, though, it hasn't, and the problem appears to have
become worse and worse. I've tried starting osds, restarting whole
hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
and noout. I've ceased rbd activity since it was getting blocked anyway.
The cluster appears to now be 'stuck' in this state:

cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
 health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
 monmap e5: 5 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
quorum 0,1,2,3,4 0,1,2,3,4
 osdmap e63276: 15 osds: 7 up, 15 in
flags noup,nodown,noout
  pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
4611 GB used, 871 GB / 5563 GB avail
838948/2578420 objects degraded (32.537%)
   3 down+remapped+peering
   8 stale+active+degraded+remapped
  85 active+clean
   1 stale+incomplete
1088 stale+down+peering
 642 active+degraded+remapped
   1 incomplete
  33 stale+remapped+peering
 135 down+peering
   1 stale+degraded
   1
stale+active+degraded+remapped+wait_backfill+backfill_toofull
 854 active+remapped
 234 stale+active+degraded
   4 active+degraded+remapped+backfill_toofull
  40 active+remapped+backfill_toofull
1079 active+degraded
   5 stale+active+clean
  74 stale+peering

Take one of the nodes. It holds osds 12 (down  in), 13 (up  in) and 14
(down  in).

# ceph osd stat
 osdmap e63276: 15 osds: 7 up, 15 in
flags noup,nodown,noout

# ceph daemon osd.12 status
no valid command found; 10 closest matches:
config show
help
log dump
get_command_descriptions
git_version
config set var val [val...]
version
2
config get var
0
admin_socket: invalid command

# ceph daemon osd.13 status
{ cluster_fsid: e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a,
  osd_fsid: d7794b10-2366-4c4f-bb4d-5f11098429b6,
  whoami: 13,
  state: active,
  oldest_map: 48214,
  newest_map: 63276,
  num_pgs: 790}

# ceph daemon osd.14 status
admin_socket: exception getting command descriptions: [Errno 111]
Connection refused

I'm assuming osds 12 and 14 are acting that way because they're not up,
but why are they different?

In terms of logs, ceph-osd.12.log doesn't go beyond this:
2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
(69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
2015-02-22 10:38:29.639802 7fd24952c780  0
filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
2015-02-22 10:38:29.876993 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
FIEMAP ioctl is supported and appears to work
2015-02-22 10:38:29.877028 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-22 10:38:31.487359 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features: