I've reworked the monitor bootstrapping. It's still a little rough around
the edges in terms of feeding in initial cluster state, but all the
monitor refactoring is done so it should be mainly cleanup from here.
The basic bootstrap/mkfs process looks something like this:
$ ceph-authtool /etc/ceph/keyring --create-keyring --gen-key -n client.admin
$ ceph-authtool /etc/ceph/keyring --gen-key -n mon.
and then either
$ monmaptool /tmp/monmap --create --clobber --add host1 1.2.3.4 --add host2
1.2.3.5 [...]
and on each host
$ ceph-mon -i `hostname` --mkfs --monmap /tmp/monmap
or define monitors, mon addrs, an fsid (`uuidgen`) in ceph.conf and on
each host
$ ceph-mon -i `hostname` --mkfs
On way or another, --mkfs is building an initial "seed" monmap that has an
fsid and a list of initial monitor addresses. If you explicitly pass in a
monmap (generated by monmaptool --create ...) that's pretty clear.
Alternatively, it will make an initial map based on the --mon-hosts a,b,c
list of addresses or on what it finds in ceph.conf. (This is the same
bootstrapping that takes place when a random daemon or tool starts up and
needs to contact a monitor to authenticate.) The fsid is required, but
can come from the generated monmap, command line (--fsid $uuid), or an
'fsid' option in ceph.conf.
There is likely some tweaking we can do here, particularly with the
manually address specification step (TV is working on this), but the basic
requirement is that we have (1) a unique fsid, (2) a list of initial
monitor addresses, and (3) a keyring with the mon. and client.admin secret
keys. Without those the new monitors don't know who to talk to to form
the new cluster and initialize themelves.
Thereafter, you can add monitors to the cluster the exact same way. As
long as the fsid matches, the secret key is valid, and one of the monitors
in the seed monmap is alive and well, the new monitor will sync itself and
then add itself to the cluster (by adding itself to the cluster's master
monmap).
For example, after adding a new [mon.`hostname`] section to your ceph.conf
with 'mon addr' defined,
$ ceph auth get mon. -o /tmp/monkey
$ fsid=`ceph fsid --concise`
$ ceph-mon -i `hostname` --mkfs -k /tmp/monkey --fsid $fsid
$ ceph-mon -i `hostname`
will add a new monitor to the cluster. Here, the new monitor gets its
peers from ceph.conf and the mon. key and fsid explicitly. You could also
pass a recent copy of the monmap instead of relying on ceph.conf (if, say,
the local ceph.conf doesn't list all monitors).
The vstart.sh script has been switched to use the new process. Mainly
this means that the initial osdmap isn't generated beforehand. Instead,
when each osd is added, we do something like
$ n=`ceph osd create --concise`
$ ceph osd crush add $n osd.$n 1.0 host=localhost rack=localrack pool=default
$ ceph-osd -i $n --mkfs --mkkey
$ ceph auth add osd.$n osd "allow *" mon "allow rwx" -i dev/osd$n/keyring
$ ceph-osd -i $n
which allocates an osd id, adds it to the crush map, initializes the osd
data dir and creates a random secret, adds that secret to the monitor auth
database, and then starts the osd.
One other piece here: currently, when a tool or daemon starts up, we build
our initial monmap (list of monitor to try to contact) in this order of
preference:
1- Was --monmap <fn> specified? (Normally it's not.)
2- Was --mon-host <list> specified? If so, resolve dns names and use
that. Fill in fsid if provided (in ceph.conf or command line;
normally it's not).
3- Look at the 'mon addr' values in the mon.* sections in my ceph.conf to
build a list. Fill in fsid if provided.
The current normal practice is #3, with a ceph.conf on every node that
had [mon.NNN] sections and mon addr values. Instead, you can do #2, which
means you have something like
[global]
mon host = one.foo.com two.foo.com three.foo.com
One nice thing is that the client will try these at random until it
connects and authenticates. Once that happens, it gets the real current
monmap, which may include hosts not listed here. That means things like
adding new monitors don't strictly require that you update ceph.conf all
over the place (although that's presumably a good thing to do at some
point).
That's where we are currently in the master branch. For those of you
working on the Chef and Juju stuff, if you have feedback on whether there
are still pain points, now's the time to share! :)
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html