Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

Travis Rhoden Mon, 26 Aug 2013 11:30:52 -0700

Hi Sage,

Thanks for the response.  I noticed that as well, and suspected
hostname/DHCP/DNS shenanigans.  What's weird is that all nodes are
identically configured.  I also have monitors running on n0 and n12, and
they come up fine, every time.

Here's the mon_host line from ceph.conf:

mon_initial_members = n0, n12, n24
mon_host = 10.0.1.0,10.0.1.12,10.0.1.24

just to test /etc/hosts and name resolution...

root@n24:~# getent hosts n24
10.0.1.24       n24
root@n24:~# hostname -s
n24

The only loopback device in /etc/hosts is "127.0.0.1       localhost", so
that should be fine.

Upon rebooting this node, I've had the monitor come up okay once, maybe out
of 12 tries.  So it appears to be some kind of race...  No clue what is
going on.  If I stop and start the monitor (or restart), it doesn't appear
to change anything.

However, on the topic of races, I having one other more pressing issue.
Each OSD host is having it's hostname assigned via DHCP.  Until that
assignment is made (during init), the hostname is "localhost", and then it
switches over to "n<x>", for some node number.  The issue I am seeing is
that there is a race between this hostname assignment and the Ceph Upstart
scripts, such that sometimes ceph-osd starts while the hostname is still
'localhost'.  This then causes the osd location to change in the crushmap,
which is going to be a very bad thing.  =)  When rebooting all my nodes at
once (there are several dozen), about 50% move from being under n<x> to
localhost.  Restarting all the ceph-osd jobs moves them back (because the
hostname is defined).

I'm wondering what kind of delay, or additional "start-on" logic I can add
to the upstart script to work around this.

On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil <[email protected]> wrote:

> Hi Travis,
>
> On Fri, 23 Aug 2013, Travis Rhoden wrote:
> > Hey folks,
> >
> > I've just done a brand new install of 0.67.2 on a cluster of Calxeda
> nodes.
> >
> > I have one particular monitor that number joins the quorum when I restart
> > the node.  Looks to  me like it has something to do with the
> "create-keys"
> > task, which never seems to finish:
> >
> > root      1240     1  4 13:03 ?        00:00:02 /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root      1244     1  0 13:03 ?        00:00:00 /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >
> > I don't see that task on my other monitors.  Additionally, that task is
> > periodically query the monitor status:
> >
> > root      1240     1  2 13:03 ?        00:00:02 /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root      1244     1  0 13:03 ?        00:00:00 /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> > root      1982  1244 15 13:04 ?        00:00:00 /usr/bin/python
> > /usr/bin/ceph --cluster=ceph
> --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> >
> > Checking that status myself, I see:
> >
> > # ceph --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> > { "name": "n24",
> >   "rank": 2,
> >   "state": "probing",
> >   "election_epoch": 0,
> >   "quorum": [],
> >   "outside_quorum": [
> >         "n24"],
> >   "extra_probe_peers": [],
> >   "sync_provider": [],
> >   "monmap": { "epoch": 2,
> >       "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
> >       "modified": "2013-08-23 12:55:34.374650",
> >       "created": "0.000000",
> >       "mons": [
> >             { "rank": 0,
> >               "name": "n0",
> >               "addr": "10.0.1.0:6789\/0"},
> >             { "rank": 1,
> >               "name": "n12",
> >               "addr": "10.0.1.12:6789\/0"},
> >             { "rank": 2,
> >               "name": "n24",
> >               "addr": "0.0.0.0:6810\/0"}]}}
>                         ^^^^^^^^^^^^^^^^^^^^
>
> This is the problem.  I can't remember exactly what causes this, though.
> Can you verify the host in ceph.conf mon_host line matches the ip that is
> configured on th machine, and that the /etc/hsots on the machine doesn't
> have a loopback address on it.
>
> Thanks!
> sage
>
>
>
>
> >
> > Any ideas what is going on here?  I don't see anything useful in
> > /var/log/ceph/ceph-mon.n24.log
> >
> >  Thanks,
> >
> >  - Travis
> >
> >
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

Reply via email to