Re: [ceph-users] anti-cephalopod question

Robert Fantini Wed, 30 Jul 2014 02:22:39 -0700

Christian.
I'll start out with 4 nodes.  I understand re-balancing  takes time. [
Eventually I'll need to swap out one of the nodes with a host I'm using for
production..   But that'll be on a Saturday afternoon.. ]



However I do not fully get this:


*"No, the default is to split at host level. So once you have enough nodes
in one room to fulfill the replication level (3) some PGs will be all in
that location "*

*can you please send this:*


*non default firefly cepf.conf settings for a 4 node  anti-cephalopod
cluster ?   *

I want to start my testing with close to ideal ceph settings .  Then do a
lot of testing of  noout and other things.
After I'm done I'll  document what was done and post it a few places.

I appreciate the suggestions you've sent .

kind regards, rob fantini









On Tue, Jul 29, 2014 at 9:49 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Tue, 29 Jul 2014 06:33:14 -0400 Robert Fantini wrote:
>
> > Christian -
> >  Thank you for the answer,   I'll get around to reading 'Crush Maps '  a
> > few times  ,  it is important to have a good understanding of ceph parts.
> >
> >  So another question -
> >
> >  As long as I keep the same number of nodes in both rooms, will  firefly
> > defaults keep data balanced?
> >
> No, the default is to split at host level.
> So once you have enough nodes in one room to fulfill the replication level
> (3) some PGs will be all in that location.
>
> >
> > If not I'll stick with 2 each room until I understand how configure
> > things.
> >
> That will work, but I would strongly advise you to get it right from the
> start, as in configure the Crush map to your needs split on room or such.
>
> Because if you introduce this change later, your data will be
> rebalanced...
>
> Christian
>
> >
> > On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > >
> > > On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:
> > >
> > > > "target replication level of 3"
> > > > " with a min of 1 across the node level"
> > > >
> > > > After reading
> > > > http://ceph.com/docs/master/rados/configuration/ceph-conf/ ,   I
> > > > assume that to accomplish that then set these in ceph.conf   ?
> > > >
> > > > osd pool default size = 3
> > > > osd pool default min size = 1
> > > >
> > > Not really, the min size specifies how few replicas need to be online
> > > for Ceph to accept IO.
> > >
> > > These (the current Firefly defaults) settings with the default crush
> > > map will have 3 sets of data spread over 3 OSDs and not use the same
> > > node (host) more than once.
> > > So with 2 nodes in each location, a replica will always be both
> > > locations. However if you add more nodes, all of them could wind up in
> > > the same building.
> > >
> > > To prevent this, you have location qualifiers beyond host and you can
> > > modify the crush map to enforce that at least one replica is in a
> > > different rack, row, room, region, etc.
> > >
> > > Advanced material, but one really needs to understand this:
> > > http://ceph.com/docs/master/rados/operations/crush-map/
> > >
> > > Christian
> > >
> > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jul 28, 2014 at 2:56 PM, Michael <mich...@onlinefusion.co.uk
> >
> > > > wrote:
> > > >
> > > > >  If you've two rooms then I'd go for two OSD nodes in each room, a
> > > > > target replication level of 3 with a min of 1 across the node
> > > > > level, then have 5 monitors and put the last monitor outside of
> > > > > either room (The other MON's can share with the OSD nodes if
> > > > > needed). Then you've got 'safe' replication for OSD/node
> > > > > replacement on failure with some 'shuffle' room for when it's
> > > > > needed and either room can be down while the external last monitor
> > > > > allows the decisions required to allow a single room to operate.
> > > > >
> > > > > There's no way you can do a 3/2 MON split that doesn't risk the two
> > > > > nodes being up and unable to serve data while the three are down so
> > > > > you'd need to find a way to make it a 2/2/1 split instead.
> > > > >
> > > > > -Michael
> > > > >
> > > > >
> > > > > On 28/07/2014 18:41, Robert Fantini wrote:
> > > > >
> > > > >  OK for higher availability then  5 nodes is better then 3 .  So
> > > > > we'll run 5 .  However we want normal operations with just 2
> > > > > nodes.   Is that possible?
> > > > >
> > > > >  Eventually 2 nodes will be next building 10 feet away , with a
> > > > > brick wall in between.  Connected with Infiniband or better. So
> > > > > one room can go off line the other will be on.   The flip of the
> > > > > coin means the 3 node room will probably go down.
> > > > >  All systems will have dual power supplies connected to different
> > > > > UPS'. In addition we have a power generator. Later we'll have a
> > > > > 2-nd generator. and then  the UPS's will use different lines
> > > > > attached to those generators somehow..
> > > > > Also of course we never count on one  cluster  to have our data.
> > > > > We have 2  co-locations with backup going to often using zfs send
> > > > > receive and or rsync .
> > > > >
> > > > >  So for the 5 node cluster,  how do we set it so 2 nodes up =
> > > > > OK ?   Or is that a bad idea?
> > > > >
> > > > >
> > > > >  PS:  any other idea on how to increase availability are welcome .
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <ch...@gol.com>
> > > > > wrote:
> > > > >
> > > > >>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
> > > > >>
> > > > >> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > > > >> > >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > > > >> > >
> > > > >> > >> Hello Christian,
> > > > >> > >>
> > > > >> > >> Let me supply more info and answer some questions.
> > > > >> > >>
> > > > >> > >> * Our main concern is high availability, not speed.
> > > > >> > >> Our storage requirements are not huge.
> > > > >> > >> However we want good keyboard response 99.99% of the time.
> > > > >> > >> We mostly do data entry and reporting.   20-25  users doing
> > > > >> > >> mostly order , invoice processing and email.
> > > > >> > >>
> > > > >> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning
> > > > >> > >> that when split brain occurs [ every 18-24 months ] it is me
> > > > >> > >> or no one who knows what to do. Try to explain how to deal
> > > > >> > >> with split brain in advance.... For the future ceph looks
> > > > >> > >> like it will be easier to maintain.
> > > > >> > >>
> > > > >> > > The DRBD people would of course tell you to configure things
> > > > >> > > in a way that a split brain can't happen. ^o^
> > > > >> > >
> > > > >> > > Note that given the right circumstances (too many OSDs down,
> > > > >> > > MONs
> > > > >> down)
> > > > >> > > Ceph can wind up in a similar state.
> > > > >> >
> > > > >> >
> > > > >> > I am not sure what you mean by ceph winding up in a similar
> > > > >> > state. If you mean regarding 'split brain' in the usual sense
> > > > >> > of the term, it does not occur in Ceph.  If it does, you have
> > > > >> > surely found a bug and you should let us know with lots of CAPS.
> > > > >> >
> > > > >> > What you can incur though if you have too many monitors down is
> > > > >> > cluster downtime.  The monitors will ensure you need a strict
> > > > >> > majority of monitors up in order to operate the cluster, and
> > > > >> > will not serve requests if said majority is not in place.  The
> > > > >> > monitors will only serve requests when there's a formed
> > > > >> > 'quorum', and a quorum is only formed by (N/2)+1 monitors, N
> > > > >> > being the total number of monitors in the cluster (via the
> > > > >> > monitor map -- monmap).
> > > > >> >
> > > > >> > This said, if out of 3 monitors you have 2 monitors down, your
> > > > >> > cluster will cease functioning (no admin commands, no writes or
> > > > >> > reads served). As there is no configuration in which you can
> > > > >> > have two strict majorities, thus no two partitions of the
> > > > >> > cluster are able to function at the same time, you do not incur
> > > > >> > in split brain.
> > > > >> >
> > > > >>  I wrote similar state, not "same state".
> > > > >>
> > > > >> From a user perspective it is purely semantics how and why your
> > > > >> shared storage has seized up, the end result is the same.
> > > > >>
> > > > >> And yes, that MON example was exactly what I was aiming for, your
> > > > >> cluster might still have all the data (another potential failure
> > > > >> mode of cause), but is inaccessible.
> > > > >>
> > > > >> DRBD will see and call it a split brain, Ceph will call it a Paxos
> > > > >> voting failure, it doesn't matter one iota to the poor sod
> > > > >> relying on that particular storage.
> > > > >>
> > > > >> My point was and is, when you design a cluster of whatever flavor,
> > > > >> make sure you understand how it can (and WILL) fail, how to
> > > > >> prevent that from happening if at all possible and how to recover
> > > > >> from it if not.
> > > > >>
> > > > >> Potentially (hopefully) in the case of Ceph it would be just to
> > > > >> get a missing MON back up.
> > > > >> But given that the failed MON might have a corrupted leveldb (it
> > > > >> happened to me) will put Robert back into square one, as in, a
> > > > >> highly qualified engineer has to deal with the issue.
> > > > >> I.e somebody who can say "screw this dead MON, lets get a new one
> > > > >> in" and is capable of doing so.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Christian
> > > > >>
> > > > >> > If you are a creative admin however, you may be able to enforce
> > > > >> > split brain by modifying monmaps.  In the end you'd obviously
> > > > >> > end up with two distinct monitor clusters, but if you so
> > > > >> > happened to not inform the clients about this there's a fair
> > > > >> > chance that it would cause havoc with unforeseen effects.  Then
> > > > >> > again, this would be the operator's fault, not Ceph itself --
> > > > >> > especially because rewriting monitor maps is not trivial enough
> > > > >> > for someone to mistakenly do something like this.
> > > > >> >
> > > > >> >    -Joao
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >>  Christian Balzer        Network/Systems Engineer
> > > > >> ch...@gol.com           Global OnLine Japan/Fusion Communications
> > > > >> http://www.gol.com/
> > > > >>  _______________________________________________
> > > > >> ceph-users mailing list
> > > > >> ceph-users@lists.ceph.com
> > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing
> > > > > listceph-us...@lists.ceph.comhttp://
> > > lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] anti-cephalopod question

Reply via email to