MDS: I assumed that I'd need to bring up a ceph-mds for my cluster at initial
bringup. We also intended to modify the CRUSH map such that it's pool is
resident to SSD(s). It is one of the areas of the online docs there doesn't
seem to be a lot of info on and I haven't spent a lot of time researching. I'll
stop it.
OSD connectivity: The connectivity is good for both 1GE and 10GE. I thought
moving to 10GE with nothing else on that net might help with group placement
etc and bring up the pages quicker. I've checked 'tcpdump' output on all boxes.
Firewall: Thanks for that one - it's the "basic" I over looked in my ceph
learning curve. One of the OSDs had selinux=enforcing - all others were
disabled. Changing that box and the 10 pages in my demo-pool (kept page count
very small for sanity) are now 'active+clean'. The pages for the default pools
- data, metadata, rbd - are still stuck in creating+peering or
creating+incomplete. I did have to use manually set 'osd pool default min size
= 1' from it's default of 2 for these 3 pools to eliminate a bunch of warnings
in the 'ceph health detail' output.
I'm adding the [mon] setting you suggested below and stopping ceph-mds and
bringing everything up now.
[root@essperf3 Ceph]# ceph -s
cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck
inactive; 192 pgs stuck unclean; 28 requests are blocked > 32 sec;
nodown,noscrub flag(s) set
monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1,
quorum 0 essperf3
mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
osdmap e752: 3 osds: 3 up, 3 in
flags nodown,noscrub
pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 objects
134 MB used, 1158 GB / 1158 GB avail
96 creating+peering
10 active+clean <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!!
96 creating+incomplete
[root@essperf3 Ceph]#
From: Brian Rak [mailto:[email protected]]
Sent: Friday, August 01, 2014 2:54 PM
To: Bruce McFarland; [email protected]
Subject: Re: [ceph-users] Firefly OSDs stuck in creating state forever
Why do you have a MDS active? I'd suggest getting rid of that at least until
you have everything else working.
I see you've set nodown on the OSDs, did you have problems with the OSDs
flapping? Do the OSDs have broken connectivity between themselves? Do you
have some kind of firewall interfering here?
I've seen odd issues when the OSDs have broken private networking, you'll get
one OSD marking all the other ones down. Adding this to my config helped:
[mon]
mon osd min down reporters = 2
On 8/1/2014 5:41 PM, Bruce McFarland wrote:
Hello,
I've run out of ideas and assume I've overlooked something very basic. I've
created 2 ceph clusters in the last 2 weeks with different OSD HW and private
network fabrics - 1GE and 10GE. I have never been able to get the OSDs to come
up to the 'active+clean' state. I have followed your online documentation and
at this point the only thing I don't think I've done is modifying the CRUSH map
(although I have been looking into that). These are new clusters with no data
and only 1 HDD and 1 SSD per OSD (24 2.5Ghz cores with 64GB RAM).
Since the disks are being recycled is there something I need to flag to let
ceph just create it's mappings, but not scrub for data compatibility? I've
tried setting the noscrub flag to no effect.
I also have constant OSD flapping. I've set nodown, but assume that is just
masking a problem that still occurring.
Besides the lack of ever reaching 'active+clean' state ceph-mon always crashes
after leaving it running overnight. The OSDs all eventually fill /root with
with ceph logs so I regularly have to bring everything down Delete logs and
restart.
I have all sorts of output from the ceph.conf; osd boot ouput with 'debug osd
-= 20' and 'debug ms = 1'; ceph -w output; and pretty much all of the
debug/monitoring suggestions from the online docs and 2 weeks of google
searches from online references in blogs, mailing lists etc.
[root@essperf3 Ceph]# ceph -v
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
[root@essperf3 Ceph]# ceph -s
cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
health HEALTH_WARN 96 pgs incomplete; 106 pgs peering; 202 pgs stuck
inactive; 202 pgs stuck unclean; nodown,noscrub flag(s) set
monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1,
quorum 0 essperf3
mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
osdmap e752: 3 osds: 3 up, 3 in
flags nodown,noscrub
pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0 objects
134 MB used, 1158 GB / 1158 GB avail
106 creating+peering
96 creating+incomplete
[root@essperf3 Ceph]#
Suggestions?
Thanks,
Bruce
_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com