Thanks, we'll give the gitbuilder packages a shot and report back.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 27, 2015 10:03 PM, "Sage Weil" <s...@newdream.net> wrote:

> On Fri, 27 Mar 2015, Robert LeBlanc wrote:
> > I've built Ceph clusters a few times now and I'm completely baffled
> > about what we are seeing. We had a majority of the nodes on a new
> > cluster go down yesterday and we got PGs stuck peering. We checked
> > logs, firewalls, file descriptors, etc and nothing is pointing to what
> > the problem is. We thought we could work around the problem by
> > deleting all the pools and recreating them, but still most of the PGs
> > were in a creating+peering state. Rebooting OSDs, reformatting them,
> > adjusting the CRUSH, etc all proved fruitless. I took min_size and
> > size to 1, tried scrubbing, deep-scrubbing the PGs and OSDs. Nothing
> > seems to get the cluster to progress.
> >
> > As a last ditch effort, we wiped the whole cluster, regenerated UUID,
> > keys, etc and pushed it all through puppet again. After creating the
> > OSDs there are PGs stuck. Here is some info:
> >
> > [ulhglive-root@mon1 ~]# ceph status
> >     cluster fa158fa8-3e5d-47b1-a7bc-98a41f510ac0
> >      health HEALTH_WARN
> >             1214 pgs peering
> >             1216 pgs stuck inactive
> >             1216 pgs stuck unclean
> >      monmap e2: 3 mons at
> > {mon1=
> 10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> >             election epoch 6, quorum 0,1,2 mon1,mon2,mon3
> >      osdmap e161: 130 osds: 130 up, 130 in
> >       pgmap v468: 2048 pgs, 2 pools, 0 bytes data, 0 objects
> >             5514 MB used, 472 TB / 472 TB avail
> >                  965 peering
> >                  832 active+clean
> >                  249 creating+peering
> >                    2 activating
>
> Usually when we've seen something like this is has been something annoying
> with the environment, like a broken network that causes the tcp streams to
> freeze once they start sending significant traffic (e.g., affecting the
> connections that transpart data but not the ones that handle heartbeats).
>
> As you're rebuilding, perhaps the issues start once you hit a particular
> rack or host?
>
> > [ulhglive-root@mon1 ~]# ceph health detail | head -n 15
> > HEALTH_WARN 1214 pgs peering; 1216 pgs stuck inactive; 1216 pgs stuck
> unclean
> > pg 2.17f is stuck inactive since forever, current state
> > creating+peering, last acting [39,42,77]
> > pg 2.17e is stuck inactive since forever, current state
> > creating+peering, last acting [125,3,110]
> > pg 2.179 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.178 is stuck inactive since forever, current state
> > creating+peering, last acting [99,120,54]
> > pg 2.17b is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.17a is stuck inactive since forever, current state
> > creating+peering, last acting [91,96,122]
> > pg 2.175 is stuck inactive since forever, current state
> > creating+peering, last acting [55,127,2]
> > pg 2.174 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.176 is stuck inactive since forever, current state
> > creating+peering, last acting [13,70,8]
> > pg 2.172 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.16c is stuck inactive for 1344.369455, current state peering,
> > last acting [99,104,85]
> > pg 2.16e is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.169 is stuck inactive since forever, current state
> > creating+peering, last acting [125,24,65]
> > pg 2.16a is stuck inactive since forever, current state peering, last
> acting [0]
> > Traceback (most recent call last):
> >   File "/bin/ceph", line 896, in <module>
> >     retval = main()
> >   File "/bin/ceph", line 883, in main
> >     sys.stdout.write(prefix + outbuf + suffix)
> > IOError: [Errno 32] Broken pipe
> > [ulhglive-root@mon1 ~]# ceph pg dump_stuck | head -n 15
> > ok
> > pg_stat state   up      up_primary      acting  acting_primary
> > 2.17f   creating+peering        [39,42,77]      39      [39,42,77]
> 39
> > 2.17e   creating+peering        [125,3,110]     125     [125,3,110]
>  125
> > 2.179   peering [0]     0       [0]     0
> > 2.178   creating+peering        [99,120,54]     99      [99,120,54]
>  99
> > 2.17b   peering [0]     0       [0]     0
> > 2.17a   creating+peering        [91,96,122]     91      [91,96,122]
>  91
> > 2.175   creating+peering        [55,127,2]      55      [55,127,2]
> 55
> > 2.174   peering [0]     0       [0]     0
> > 2.176   creating+peering        [13,70,8]       13      [13,70,8]
>  13
> > 2.172   peering [0]     0       [0]     0
> > 2.16c   peering [99,104,85]     99      [99,104,85]     99
> > 2.16e   peering [0]     0       [0]     0
> > 2.169   creating+peering        [125,24,65]     125     [125,24,65]
>  125
> > 2.16a   peering [0]     0       [0]     0
> >
> > Focusing on 2.17f on OSD 39, I set debugging to 20/20 and am attaching
> > the logs. I've looked through the logs with 20/20 before we toasted
> > the cluster and I couldn't find anything standing out. I have another
> > cluster that is also exhibiting this problem which I'd prefer not to
> > lose the data on. If anything stands out, please let me know. We are
> > going to wipe this cluster again and take more manual steps.
> >
> > ceph-osd.39.log.xz -
> >
> https://owncloud.leblancnet.us/owncloud/public.php?service=files&t=b120a67cc6111ffcba54d2e4cc8a62b5
> > map.xz -
> https://owncloud.leblancnet.us/owncloud/public.php?service=files&t=df1eecf7d307225b7d43b5c9474561d0
>
> It looks liek this particular PG isn't getting a query response from
> osd.39 and osd.42.  The 'ceph pg 2.17f query' will likely tell you
> something similar that it is trying to get info from those OSDs.  If you
> crank up debug ms = 20 you'll be able watch it try to connect and send
> messages to those peers as well, and if you have logging on the other
> end you can see if the message arrives or not.
>
> It's also possible that this is a bug in 0.93 that we've fixed (there have
> been tons of those); before investing too much effort I would try
> installing the latest hammer branch from the gitbuilders as that's
> very very close to what will be released next week.
>
> Hope that helps!
> sage
>
>
> >
> >
> > After redoing the cluster again, we started slow. We added one OSD,
> > dropped the pools to min_size=1 and size=1, and the cluster became
> > healthy. We added a second OSD and changed the CRUSH rule to OSD and
> > it became healthy again. We change size=3 and min_size=2. We had
> > puppet add 10 OSDs on one host, and waited, the cluster became healthy
> > again. We had puppet add another host with 10 OSDs and waited for the
> > cluster to become healthy again. We had puppet add the 8 remaining
> > OSDs on the first host and the cluster became healthy again. We set
> > the CRUSH rule back to host and the cluster became healthy again.
> >
> > In order to test a theory we decided to kick off puppet on the
> > remaining 10 hosts with 10 OSDs each at the same time (similar to what
> > we did before). When about the 97th OSD was added, we started getting
> > messages in ceph -w about stuck PGs and the cluster never became
> > healthy.
> >
> > I wonder if there are too many changes in too short of an amount of
> > time causing the OSDs to overrun a journal or something (I know that
> > Ceph journals pgmap changes and such). I'm concerned that this could
> > be very detrimental in a production environment. There doesn't seem to
> > be a way to recover from this.
> >
> > Any thoughts?
> >
> > Thanks,
> > Robert LeBlanc
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to