Re: [ceph-users] anti-cephalopod question

2014-07-30 Thread Robert Fantini
Christian.
I'll start out with 4 nodes.  I understand re-balancing  takes time. [
Eventually I'll need to swap out one of the nodes with a host I'm using for
production..   But that'll be on a Saturday afternoon.. ]


However I do not fully get this:


*No, the default is to split at host level. So once you have enough nodes
in one room to fulfill the replication level (3) some PGs will be all in
that location *

*can you please send this:*


*non default firefly cepf.conf settings for a 4 node  anti-cephalopod
cluster ?   *

I want to start my testing with close to ideal ceph settings .  Then do a
lot of testing of  noout and other things.
After I'm done I'll  document what was done and post it a few places.

I appreciate the suggestions you've sent .

kind regards, rob fantini









On Tue, Jul 29, 2014 at 9:49 PM, Christian Balzer ch...@gol.com wrote:


 Hello,

 On Tue, 29 Jul 2014 06:33:14 -0400 Robert Fantini wrote:

  Christian -
   Thank you for the answer,   I'll get around to reading 'Crush Maps '  a
  few times  ,  it is important to have a good understanding of ceph parts.
 
   So another question -
 
   As long as I keep the same number of nodes in both rooms, will  firefly
  defaults keep data balanced?
 
 No, the default is to split at host level.
 So once you have enough nodes in one room to fulfill the replication level
 (3) some PGs will be all in that location.

 
  If not I'll stick with 2 each room until I understand how configure
  things.
 
 That will work, but I would strongly advise you to get it right from the
 start, as in configure the Crush map to your needs split on room or such.

 Because if you introduce this change later, your data will be
 rebalanced...

 Christian

 
  On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer ch...@gol.com wrote:
 
  
   On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:
  
target replication level of 3
 with a min of 1 across the node level
   
After reading
http://ceph.com/docs/master/rados/configuration/ceph-conf/ ,   I
assume that to accomplish that then set these in ceph.conf   ?
   
osd pool default size = 3
osd pool default min size = 1
   
   Not really, the min size specifies how few replicas need to be online
   for Ceph to accept IO.
  
   These (the current Firefly defaults) settings with the default crush
   map will have 3 sets of data spread over 3 OSDs and not use the same
   node (host) more than once.
   So with 2 nodes in each location, a replica will always be both
   locations. However if you add more nodes, all of them could wind up in
   the same building.
  
   To prevent this, you have location qualifiers beyond host and you can
   modify the crush map to enforce that at least one replica is in a
   different rack, row, room, region, etc.
  
   Advanced material, but one really needs to understand this:
   http://ceph.com/docs/master/rados/operations/crush-map/
  
   Christian
  
  
   
   
   
   
   
   
On Mon, Jul 28, 2014 at 2:56 PM, Michael mich...@onlinefusion.co.uk
 
wrote:
   
  If you've two rooms then I'd go for two OSD nodes in each room, a
 target replication level of 3 with a min of 1 across the node
 level, then have 5 monitors and put the last monitor outside of
 either room (The other MON's can share with the OSD nodes if
 needed). Then you've got 'safe' replication for OSD/node
 replacement on failure with some 'shuffle' room for when it's
 needed and either room can be down while the external last monitor
 allows the decisions required to allow a single room to operate.

 There's no way you can do a 3/2 MON split that doesn't risk the two
 nodes being up and unable to serve data while the three are down so
 you'd need to find a way to make it a 2/2/1 split instead.

 -Michael


 On 28/07/2014 18:41, Robert Fantini wrote:

  OK for higher availability then  5 nodes is better then 3 .  So
 we'll run 5 .  However we want normal operations with just 2
 nodes.   Is that possible?

  Eventually 2 nodes will be next building 10 feet away , with a
 brick wall in between.  Connected with Infiniband or better. So
 one room can go off line the other will be on.   The flip of the
 coin means the 3 node room will probably go down.
  All systems will have dual power supplies connected to different
 UPS'. In addition we have a power generator. Later we'll have a
 2-nd generator. and then  the UPS's will use different lines
 attached to those generators somehow..
 Also of course we never count on one  cluster  to have our data.
 We have 2  co-locations with backup going to often using zfs send
 receive and or rsync .

  So for the 5 node cluster,  how do we set it so 2 nodes up =
 OK ?   Or is that a bad idea?


  PS:  any other idea on how to increase availability are welcome .








   

Re: [ceph-users] anti-cephalopod question

2014-07-29 Thread Robert Fantini
Christian -
 Thank you for the answer,   I'll get around to reading 'Crush Maps '  a
few times  ,  it is important to have a good understanding of ceph parts.

 So another question -

 As long as I keep the same number of nodes in both rooms, will  firefly
defaults keep data balanced?


If not I'll stick with 2 each room until I understand how configure things.


On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer ch...@gol.com wrote:


 On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:

  target replication level of 3
   with a min of 1 across the node level
 
  After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/
  ,   I assume that to accomplish that then set these in ceph.conf   ?
 
  osd pool default size = 3
  osd pool default min size = 1
 
 Not really, the min size specifies how few replicas need to be online
 for Ceph to accept IO.

 These (the current Firefly defaults) settings with the default crush map
 will have 3 sets of data spread over 3 OSDs and not use the same node
 (host) more than once.
 So with 2 nodes in each location, a replica will always be both locations.
 However if you add more nodes, all of them could wind up in the same
 building.

 To prevent this, you have location qualifiers beyond host and you can
 modify the crush map to enforce that at least one replica is in a
 different rack, row, room, region, etc.

 Advanced material, but one really needs to understand this:
 http://ceph.com/docs/master/rados/operations/crush-map/

 Christian


 
 
 
 
 
 
  On Mon, Jul 28, 2014 at 2:56 PM, Michael mich...@onlinefusion.co.uk
  wrote:
 
If you've two rooms then I'd go for two OSD nodes in each room, a
   target replication level of 3 with a min of 1 across the node level,
   then have 5 monitors and put the last monitor outside of either room
   (The other MON's can share with the OSD nodes if needed). Then you've
   got 'safe' replication for OSD/node replacement on failure with some
   'shuffle' room for when it's needed and either room can be down while
   the external last monitor allows the decisions required to allow a
   single room to operate.
  
   There's no way you can do a 3/2 MON split that doesn't risk the two
   nodes being up and unable to serve data while the three are down so
   you'd need to find a way to make it a 2/2/1 split instead.
  
   -Michael
  
  
   On 28/07/2014 18:41, Robert Fantini wrote:
  
OK for higher availability then  5 nodes is better then 3 .  So we'll
   run 5 .  However we want normal operations with just 2 nodes.   Is that
   possible?
  
Eventually 2 nodes will be next building 10 feet away , with a brick
   wall in between.  Connected with Infiniband or better. So one room can
   go off line the other will be on.   The flip of the coin means the 3
   node room will probably go down.
All systems will have dual power supplies connected to different UPS'.
   In addition we have a power generator. Later we'll have a 2-nd
   generator. and then  the UPS's will use different lines attached to
   those generators somehow..
   Also of course we never count on one  cluster  to have our data.  We
   have 2  co-locations with backup going to often using zfs send receive
   and or rsync .
  
So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or
   is that a bad idea?
  
  
PS:  any other idea on how to increase availability are welcome .
  
  
  
  
  
  
  
  
   On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer ch...@gol.com
   wrote:
  
On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
  
On 07/28/2014 08:49 AM, Christian Balzer wrote:

 Hello,

 On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:

 Hello Christian,

 Let me supply more info and answer some questions.

 * Our main concern is high availability, not speed.
 Our storage requirements are not huge.
 However we want good keyboard response 99.99% of the time.   We
 mostly do data entry and reporting.   20-25  users doing mostly
 order , invoice processing and email.

 * DRBD has been very reliable , but I am the SPOF .   Meaning
 that when split brain occurs [ every 18-24 months ] it is me or
 no one who knows what to do. Try to explain how to deal with
 split brain in advance For the future ceph looks like it
 will be easier to maintain.

 The DRBD people would of course tell you to configure things in a
 way that a split brain can't happen. ^o^

 Note that given the right circumstances (too many OSDs down, MONs
   down)
 Ceph can wind up in a similar state.
   
   
I am not sure what you mean by ceph winding up in a similar state.
If you mean regarding 'split brain' in the usual sense of the term,
it does not occur in Ceph.  If it does, you have surely found a bug
and you should let us know with lots of CAPS.
   
What you can incur though if you have too many monitors 

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Christian Balzer

Hello,

On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:

 Hello Christian,
 
 Let me supply more info and answer some questions.
 
 * Our main concern is high availability, not speed.
 Our storage requirements are not huge.
 However we want good keyboard response 99.99% of the time.   We mostly do
 data entry and reporting.   20-25  users doing mostly order , invoice
 processing and email.
 
 * DRBD has been very reliable , but I am the SPOF .   Meaning that when
 split brain occurs [ every 18-24 months ] it is me or no one who knows
 what to do. Try to explain how to deal with split brain in advance
 For the future ceph looks like it will be easier to maintain.
 
The DRBD people would of course tell you to configure things in a way that
a split brain can't happen. ^o^

Note that given the right circumstances (too many OSDs down, MONs down)
Ceph can wind up in a similar state.

 * We use Proxmox . So ceph and mons will share each node. I've used
 proxmox for a few years and like the kvm / openvz management.
 
I tried it some time ago, but at that time it was still stuck with 2.6.32
due to OpenVZ and that wasn't acceptable to me for various reasons. 
I think it still is, too.

 * Ceph hardware:
 
 Four  hosts .  8 drives each.
 
 OPSYS: raid-1  on ssd .
 
Good, that should be sufficient for running MONs (you will want 3).

 OSD: four disk raid 10 array using  2-TB drives.
 
 Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
 128MB Cache SAS 6Gb/s
 
 the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
 Cache SATA 6.0Gb/s   drives.
 
 Journal: 200GB Intel DC S3700 Series
 
 Spare disk for raid.
 
 * more questions.
 you wrote:
 In essence, if your current setup can't handle the loss of a single
 disk, what happens if a node fails?
 You will need to design (HW) and configure (various Ceph options) your
 cluster to handle these things because at some point a recovery might be
 unavoidable.
 
 To prevent recoveries based on failed disks, use RAID, for node failures
 you could permanently set OSD noout or have a monitoring software do that
 when it detects a node failure.
 
 I'll research  'OSD noout' .

You probably might be happy with the mon osd downout subtree limit set
to host as well.
In that case you will need to manually trigger a rebuild (set that
node/OSD to out) if you can't repair a failed node in a short time and
keep your redundancy levels.
 
 Are there other setting I should read up on / consider?
 
 For node reboots due to kernel upgrades -  how is that handled?   Of
 course that would be scheduled for off hours.
 
Set noout before a planned downtime or live dangerously and assume it
comes back within the timeout period (5 minutes IIRC).

 Any other suggestions?
 
Test your cluster extensively before going into production.

Fill it with enough data to be close to what you're expecting and fail one
node/OSD. 

See how bad things become, try to determine where any bottlenecks are with
tools like atop.

While you've done pretty much everything to prevent that scenario from a
disk failure with the RAID10 and by keeping nodes from being set out by
whatever means you choose (mon osd downout subtree limit = host seems to
work, I just tested it), having a cluster that doesn't melt down when
recovering or at least knowing how bad things will be in such a scenario
helps a lot.

Regards,

Christian

 thanks for the suggestions,
 Rob
 
 
 On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer ch...@gol.com wrote:
 
 
  Hello,
 
  actually replying in the other thread was fine by me, it was after
  relevant in a sense to it.
  And you mentioned something important there, which you didn't mention
  below, that you're coming from DRBD with a lot of experience there.
 
  So do I and Ceph/RBD simply isn't (and probably never will be) an
  adequate replacement for DRBD in some use cases.
  I certainly plan to keep deploying DRBD where it makes more sense
  (IOPS/speed), while migrating everything else to Ceph.
 
  Anyway, lets look at your mail:
 
  On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:
 
   I've a question regarding advice from these threads:
  
  https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
  
   https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
  
  
  
Our current setup has 4 osd's per node.When a drive  fails   the
   cluster is almost unusable for data entry.   I want to change our
   set up so that under no circumstances ever happens.
  
 
  While you can pretty much avoid this from happening, your cluster
  should be able to handle a recovery.
  While Ceph is a bit more hamfisted than DRBD and definitely needs more
  controls and tuning to make recoveries have less of an impact you would
  see something similar with DRBD and badly configured recovery speeds.
 
  In essence, if your current setup can't handle the loss of a single
  disk, what happens if a node fails?
  You 

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Michael
If you've two rooms then I'd go for two OSD nodes in each room, a target 
replication level of 3 with a min of 1 across the node level, then have 
5 monitors and put the last monitor outside of either room (The other 
MON's can share with the OSD nodes if needed). Then you've got 'safe' 
replication for OSD/node replacement on failure with some 'shuffle' room 
for when it's needed and either room can be down while the external last 
monitor allows the decisions required to allow a single room to operate.


There's no way you can do a 3/2 MON split that doesn't risk the two 
nodes being up and unable to serve data while the three are down so 
you'd need to find a way to make it a 2/2/1 split instead.


-Michael

On 28/07/2014 18:41, Robert Fantini wrote:
OK for higher availability then  5 nodes is better then 3 .  So we'll 
run 5 .  However we want normal operations with just 2 nodes.   Is 
that possible?


Eventually 2 nodes will be next building 10 feet away , with a brick 
wall in between.  Connected with Infiniband or better. So one room can 
go off line the other will be on.   The flip of the coin means the 3 
node room will probably go down.
 All systems will have dual power supplies connected to different 
UPS'.   In addition we have a power generator. Later we'll have a 2-nd 
generator. and then  the UPS's will use different lines attached to 
those generators somehow..
Also of course we never count on one  cluster  to have our data.  We 
have 2  co-locations with backup going to often using zfs send receive 
and or rsync .


So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or 
is that a bad idea?



PS:  any other idea on how to increase availability are welcome .








On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer ch...@gol.com 
mailto:ch...@gol.com wrote:


On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:

 On 07/28/2014 08:49 AM, Christian Balzer wrote:
 
  Hello,
 
  On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
 
  Hello Christian,
 
  Let me supply more info and answer some questions.
 
  * Our main concern is high availability, not speed.
  Our storage requirements are not huge.
  However we want good keyboard response 99.99% of the time.   We
  mostly do data entry and reporting. 20-25  users doing mostly
  order , invoice processing and email.
 
  * DRBD has been very reliable , but I am the SPOF .   Meaning
that
  when split brain occurs [ every 18-24 months ] it is me or no
one who
  knows what to do. Try to explain how to deal with split brain in
  advance For the future ceph looks like it will be easier to
  maintain.
 
  The DRBD people would of course tell you to configure things
in a way
  that a split brain can't happen. ^o^
 
  Note that given the right circumstances (too many OSDs down,
MONs down)
  Ceph can wind up in a similar state.


 I am not sure what you mean by ceph winding up in a similar
state.  If
 you mean regarding 'split brain' in the usual sense of the term,
it does
 not occur in Ceph.  If it does, you have surely found a bug and you
 should let us know with lots of CAPS.

 What you can incur though if you have too many monitors down is
cluster
 downtime.  The monitors will ensure you need a strict majority of
 monitors up in order to operate the cluster, and will not serve
requests
 if said majority is not in place.  The monitors will only serve
requests
 when there's a formed 'quorum', and a quorum is only formed by
(N/2)+1
 monitors, N being the total number of monitors in the cluster
(via the
 monitor map -- monmap).

 This said, if out of 3 monitors you have 2 monitors down, your
cluster
 will cease functioning (no admin commands, no writes or reads
served).
 As there is no configuration in which you can have two strict
 majorities, thus no two partitions of the cluster are able to
function
 at the same time, you do not incur in split brain.

I wrote similar state, not same state.

From a user perspective it is purely semantics how and why your shared
storage has seized up, the end result is the same.

And yes, that MON example was exactly what I was aiming for, your
cluster
might still have all the data (another potential failure mode of
cause),
but is inaccessible.

DRBD will see and call it a split brain, Ceph will call it a Paxos
voting
failure, it doesn't matter one iota to the poor sod relying on that
particular storage.

My point was and is, when you design a cluster of whatever flavor,
make
sure you understand how it can (and WILL) fail, how to prevent
that from
happening if at all possible and how to recover from it if not.

Potentially (hopefully) in the case of Ceph it would be 

Re: [ceph-users] anti-cephalopod question

2014-07-27 Thread Robert Fantini
Hello Christian,

Let me supply more info and answer some questions.

* Our main concern is high availability, not speed.
Our storage requirements are not huge.
However we want good keyboard response 99.99% of the time.   We mostly do
data entry and reporting.   20-25  users doing mostly order , invoice
processing and email.

* DRBD has been very reliable , but I am the SPOF .   Meaning that when
split brain occurs [ every 18-24 months ] it is me or no one who knows what
to do. Try to explain how to deal with split brain in advance For the
future ceph looks like it will be easier to maintain.

* We use Proxmox . So ceph and mons will share each node. I've used proxmox
for a few years and like the kvm / openvz management.

* Ceph hardware:

Four  hosts .  8 drives each.

OPSYS: raid-1  on ssd .

OSD: four disk raid 10 array using  2-TB drives.

Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM 128MB
Cache SAS 6Gb/s

the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
Cache SATA 6.0Gb/s   drives.

Journal: 200GB Intel DC S3700 Series

Spare disk for raid.

* more questions.
you wrote:
In essence, if your current setup can't handle the loss of a single disk,
what happens if a node fails?
You will need to design (HW) and configure (various Ceph options) your
cluster to handle these things because at some point a recovery might be
unavoidable.

To prevent recoveries based on failed disks, use RAID, for node failures
you could permanently set OSD noout or have a monitoring software do that
when it detects a node failure.

I'll research  'OSD noout' .

Are there other setting I should read up on / consider?

For node reboots due to kernel upgrades -  how is that handled?   Of course
that would be scheduled for off hours.

Any other suggestions?

thanks for the suggestions,
Rob


On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer ch...@gol.com wrote:


 Hello,

 actually replying in the other thread was fine by me, it was after
 relevant in a sense to it.
 And you mentioned something important there, which you didn't mention
 below, that you're coming from DRBD with a lot of experience there.

 So do I and Ceph/RBD simply isn't (and probably never will be) an adequate
 replacement for DRBD in some use cases.
 I certainly plan to keep deploying DRBD where it makes more sense
 (IOPS/speed), while migrating everything else to Ceph.

 Anyway, lets look at your mail:

 On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:

  I've a question regarding advice from these threads:
 
 https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
 
  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
 
 
 
   Our current setup has 4 osd's per node.When a drive  fails   the
  cluster is almost unusable for data entry.   I want to change our set up
  so that under no circumstances ever happens.
 

 While you can pretty much avoid this from happening, your cluster should
 be able to handle a recovery.
 While Ceph is a bit more hamfisted than DRBD and definitely needs more
 controls and tuning to make recoveries have less of an impact you would
 see something similar with DRBD and badly configured recovery speeds.

 In essence, if your current setup can't handle the loss of a single disk,
 what happens if a node fails?
 You will need to design (HW) and configure (various Ceph options) your
 cluster to handle these things because at some point a recovery might be
 unavoidable.

 To prevent recoveries based on failed disks, use RAID, for node failures
 you could permanently set OSD noout or have a monitoring software do that
 when it detects a node failure.

   Network:  we use 2 IB switches and  bonding in fail over mode.
   Systems are two  Dell Poweredge r720 and Supermicro X8DT3 .
 

 I'm confused. Those Dells tend to have 8 drive bays normally, don't they?
 So you're just using 4 HDDs for OSDs? No SSD journals?
 Just 2 storage nodes?
 Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual
 disk failures. Which will happen.

 Also that SM product number is for a motherboard, not a server, is that
 your monitor host?
 Anything production with data on in that you value should have 3 mon
 hosts, if you can't afford dedicated ones sharing them on an OSD node
 (preferably with the OS on SSDs to keep leveldb happy) is better than just
 one, because if that one dies or gets corrupted, your data is inaccessible.

   So looking at how to do things better we will try  '#4- anti-cephalopod'
  .   That is a seriously funny phrase!
 
  We'll switch to using raid-10 or raid-6 and have one osd per node, using
  high end raid controllers,  hot spares etc.
 
 Are you still talking about the same hardware as above, just 4 HDDs for
 storage?
 With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you
 have more bays use up to 12 for RAID6 with a high performance and large
 HW cache controller.

  And use one Intel 200gb 

Re: [ceph-users] anti-cephalopod question

2014-07-25 Thread Christian Balzer

Hello,

actually replying in the other thread was fine by me, it was after
relevant in a sense to it.
And you mentioned something important there, which you didn't mention
below, that you're coming from DRBD with a lot of experience there.

So do I and Ceph/RBD simply isn't (and probably never will be) an adequate
replacement for DRBD in some use cases. 
I certainly plan to keep deploying DRBD where it makes more sense
(IOPS/speed), while migrating everything else to Ceph.

Anyway, lets look at your mail:

On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:

 I've a question regarding advice from these threads:
 https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
 
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
 
 
 
  Our current setup has 4 osd's per node.When a drive  fails   the
 cluster is almost unusable for data entry.   I want to change our set up
 so that under no circumstances ever happens.
 

While you can pretty much avoid this from happening, your cluster should
be able to handle a recovery.
While Ceph is a bit more hamfisted than DRBD and definitely needs more
controls and tuning to make recoveries have less of an impact you would
see something similar with DRBD and badly configured recovery speeds.

In essence, if your current setup can't handle the loss of a single disk,
what happens if a node fails?
You will need to design (HW) and configure (various Ceph options) your
cluster to handle these things because at some point a recovery might be
unavoidable. 

To prevent recoveries based on failed disks, use RAID, for node failures
you could permanently set OSD noout or have a monitoring software do that
when it detects a node failure.

  Network:  we use 2 IB switches and  bonding in fail over mode.
  Systems are two  Dell Poweredge r720 and Supermicro X8DT3 .
 

I'm confused. Those Dells tend to have 8 drive bays normally, don't they?
So you're just using 4 HDDs for OSDs? No SSD journals?
Just 2 storage nodes? 
Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual
disk failures. Which will happen. 

Also that SM product number is for a motherboard, not a server, is that
your monitor host?
Anything production with data on in that you value should have 3 mon
hosts, if you can't afford dedicated ones sharing them on an OSD node
(preferably with the OS on SSDs to keep leveldb happy) is better than just
one, because if that one dies or gets corrupted, your data is inaccessible.

  So looking at how to do things better we will try  '#4- anti-cephalopod'
 .   That is a seriously funny phrase!
 
 We'll switch to using raid-10 or raid-6 and have one osd per node, using
 high end raid controllers,  hot spares etc.
 
Are you still talking about the same hardware as above, just 4 HDDs for
storage? 
With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you
have more bays use up to 12 for RAID6 with a high performance and large
HW cache controller.  

 And use one Intel 200gb S3700 per node for journal
 
That's barely enough for 4 HDDs at 365MB/s write speed, but will do
nicely if those are in a RAID10 (half speed of individual drives). 
Keep in mind that your node will never be able to write faster than the
speed of your journal.

 My questions:
 
 is there a minimum number of OSD's which should be used?
 
If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes
is sufficient to begin with. 
However your performance might not be what you expect (an OSD process
seems to be incapable of doing more than 800 write IOPS). 
But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's
not so much of an issue. 
In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it
certainly is rather frustrating.

In short, the more nodes (OSDs) you can deploy, the better the
performance will be. And of course in case a node dies and you don't
think it can be brought back in a sensible short time frame, having more
than 2 nodes will enable you to do a recovery/rebalance and restore your
redundancy to the desired level. 

 should  OSD's per node be the same?
 
It is advantageous to have identical disks and OSD sizes, makes the whole
thing more predictable and you don't have to play with weights.

As for having different number of OSDs per node, consider this example:

4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size).
What will happen here is that all the replicas from single OSD nodes might
wind up on the 4 OSD node. So it better have more power in all aspects
than the single OSD nodes.
Now that node fails and you decide to let things rebalance as it can't be
repaired shortly. But you cluster was half full and now it will be 100%
full and become unusable (for writes). 

So the moral of the story, deploy as much identical HW as possible. 

Christian

 best regards, Rob
 
 
 PS:  I had asked above in middle of another thread...  please ignore
 there.


--