Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2016-03-10 Thread Christian Balzer

Hello,

On Thu, 10 Mar 2016 22:25:10 -0500 Alex Gorbachev wrote:

> Reviving an old thread:
> 
> On Sunday, July 12, 2015, Lionel Bouton  wrote:
> 
> > On 07/12/15 05:55, Alex Gorbachev wrote:
> > > FWIW. Based on the excellent research by Mark Nelson
> > > (
> > http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
> > )
> > > we have dropped SSD journals altogether, and instead went for the
> > > battery protected controller writeback cache.
> >
> > Note that this has limitations (and the research is nearly 2 years
> > old):
> > - the controller writeback caches are relatively small (often less than
> > 4GB, 2GB is common on the controller, a small portion is not usable,
> > and 10% of the rest is often used for readahead/read cache) and this is
> > shared by all of your drives. If your workload is not "write spikes"
> > oriented, but nearly constant writes this won't help as you will be
> > limited on each OSD by roughly half of the disk IOPS. With journals on
> > SSDs when you hit their limit (which is ~5GB of buffer for 10GB
> > journals and not <2GB divided by the amount of OSDs per controller),
> > the limit is the raw disk IOPS.
> > - you *must* make sure the controller is configured to switch to
> > write-through when the battery/capacitor fails (or a power failure on
> > hardware from the same generation could make you lose all of the OSDs
> > connected to them in a single event which means data loss),
> > - you should monitor the battery/capacitor status to trigger
> > maintenance (and your cluster will slow down while the
> > battery/capacitor is waiting for a replacement, you might want to down
> > the associated OSDs depending on your cluster configuration). We
> > mostly eliminated this problem by replacing the whole chassis of the
> > servers we lease for new generations every 2 or 3 years: if you time
> > the hardware replacement to match a fresh chassis generation this
> > means fresh capacitors and they shouldn't fail you (ours are rated for
> > 3 years).
> >
> > We just ordered Intel S3710 SSDs even though we have battery/capacitor
> > backed caches on the controllers: the latencies have started to rise
> > nevertheless when there are long periods of write intensive activity.
> > I'm currently pondering if we should bypass the write-cache for the
> > SSDs. The cache is obviously less effective on them and might be more
> > useful overall if it is dedicated to the rotating disks. Does anyone
> > have test results with cache active/inactive on SSD journals with HP
> > Smart Array p420 or p840 controllers?
> 
> 
> We have come to the same conclusion once we started seeing some more
> constant write loads. Thank you for the great info - question: have you
> tried SSD journals with and without additional controller cache?  Any
> benefit?
>
Haven't tried that with journals SSDs, simply because I tend to use
basically DC S3700s there, which would benefit little considering the cost
of a fast enough controller with ample cache.

That said, I've done this both with HDDs and on disk journals (with the
expected results as detailed above) and with consumer Intel 530 SSDs on
some Twin servers that came with LSI 2108 controllers.

In the later case these are OS disk, nothing Ceph related. 
But the HW controller cache nicely masks the garbage collection spikes and
slowness of SYNC writes of these SSDs in medium load scenarios. 

In short, HW cache should always help, but it can do only so much (for so
long) so unless you already have HW with it or can get it dirt cheap, it's
not particular economic once you reach its limits.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2016-03-10 Thread Alex Gorbachev
Reviving an old thread:

On Sunday, July 12, 2015, Lionel Bouton  wrote:

> On 07/12/15 05:55, Alex Gorbachev wrote:
> > FWIW. Based on the excellent research by Mark Nelson
> > (
> http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
> )
> > we have dropped SSD journals altogether, and instead went for the
> > battery protected controller writeback cache.
>
> Note that this has limitations (and the research is nearly 2 years old):
> - the controller writeback caches are relatively small (often less than
> 4GB, 2GB is common on the controller, a small portion is not usable, and
> 10% of the rest is often used for readahead/read cache) and this is
> shared by all of your drives. If your workload is not "write spikes"
> oriented, but nearly constant writes this won't help as you will be
> limited on each OSD by roughly half of the disk IOPS. With journals on
> SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals
> and not <2GB divided by the amount of OSDs per controller), the limit is
> the raw disk IOPS.
> - you *must* make sure the controller is configured to switch to
> write-through when the battery/capacitor fails (or a power failure on
> hardware from the same generation could make you lose all of the OSDs
> connected to them in a single event which means data loss),
> - you should monitor the battery/capacitor status to trigger maintenance
> (and your cluster will slow down while the battery/capacitor is waiting
> for a replacement, you might want to down the associated OSDs depending
> on your cluster configuration). We mostly eliminated this problem by
> replacing the whole chassis of the servers we lease for new generations
> every 2 or 3 years: if you time the hardware replacement to match a
> fresh chassis generation this means fresh capacitors and they shouldn't
> fail you (ours are rated for 3 years).
>
> We just ordered Intel S3710 SSDs even though we have battery/capacitor
> backed caches on the controllers: the latencies have started to rise
> nevertheless when there are long periods of write intensive activity.
> I'm currently pondering if we should bypass the write-cache for the
> SSDs. The cache is obviously less effective on them and might be more
> useful overall if it is dedicated to the rotating disks. Does anyone
> have test results with cache active/inactive on SSD journals with HP
> Smart Array p420 or p840 controllers?


We have come to the same conclusion once we started seeing some more
constant write loads. Thank you for the great info - question: have you
tried SSD journals with and without additional controller cache?  Any
benefit?

Thank you,
Alex



>
> Lionel
>


-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-12 Thread Christian Balzer

Hello,

thanks to Lionel for writing pretty much what I was going to, in
particular cache sizes and read-ahead cache allocations. 

In addition to this keep in mind that all writes still have to happen
twice per drive, journal and actual OSD. 
So when the cache is to busy to merge writes nicely your HDD IOPS is being
halved again.

Christian

On Sun, 12 Jul 2015 14:33:03 +0200 Lionel Bouton wrote:

 On 07/12/15 05:55, Alex Gorbachev wrote:
  FWIW. Based on the excellent research by Mark Nelson
  (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
  we have dropped SSD journals altogether, and instead went for the
  battery protected controller writeback cache.
 
 Note that this has limitations (and the research is nearly 2 years old):
 - the controller writeback caches are relatively small (often less than
 4GB, 2GB is common on the controller, a small portion is not usable, and
 10% of the rest is often used for readahead/read cache) and this is
 shared by all of your drives. If your workload is not write spikes
 oriented, but nearly constant writes this won't help as you will be
 limited on each OSD by roughly half of the disk IOPS. With journals on
 SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals
 and not 2GB divided by the amount of OSDs per controller), the limit is
 the raw disk IOPS.
 - you *must* make sure the controller is configured to switch to
 write-through when the battery/capacitor fails (or a power failure on
 hardware from the same generation could make you lose all of the OSDs
 connected to them in a single event which means data loss),
 - you should monitor the battery/capacitor status to trigger maintenance
 (and your cluster will slow down while the battery/capacitor is waiting
 for a replacement, you might want to down the associated OSDs depending
 on your cluster configuration). We mostly eliminated this problem by
 replacing the whole chassis of the servers we lease for new generations
 every 2 or 3 years: if you time the hardware replacement to match a
 fresh chassis generation this means fresh capacitors and they shouldn't
 fail you (ours are rated for 3 years).
 
 We just ordered Intel S3710 SSDs even though we have battery/capacitor
 backed caches on the controllers: the latencies have started to rise
 nevertheless when there are long periods of write intensive activity.
 I'm currently pondering if we should bypass the write-cache for the
 SSDs. The cache is obviously less effective on them and might be more
 useful overall if it is dedicated to the rotating disks. Does anyone
 have test results with cache active/inactive on SSD journals with HP
 Smart Array p420 or p840 controllers?
 
 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-12 Thread Lionel Bouton
On 07/12/15 05:55, Alex Gorbachev wrote:
 FWIW. Based on the excellent research by Mark Nelson
 (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
 we have dropped SSD journals altogether, and instead went for the
 battery protected controller writeback cache.

Note that this has limitations (and the research is nearly 2 years old):
- the controller writeback caches are relatively small (often less than
4GB, 2GB is common on the controller, a small portion is not usable, and
10% of the rest is often used for readahead/read cache) and this is
shared by all of your drives. If your workload is not write spikes
oriented, but nearly constant writes this won't help as you will be
limited on each OSD by roughly half of the disk IOPS. With journals on
SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals
and not 2GB divided by the amount of OSDs per controller), the limit is
the raw disk IOPS.
- you *must* make sure the controller is configured to switch to
write-through when the battery/capacitor fails (or a power failure on
hardware from the same generation could make you lose all of the OSDs
connected to them in a single event which means data loss),
- you should monitor the battery/capacitor status to trigger maintenance
(and your cluster will slow down while the battery/capacitor is waiting
for a replacement, you might want to down the associated OSDs depending
on your cluster configuration). We mostly eliminated this problem by
replacing the whole chassis of the servers we lease for new generations
every 2 or 3 years: if you time the hardware replacement to match a
fresh chassis generation this means fresh capacitors and they shouldn't
fail you (ours are rated for 3 years).

We just ordered Intel S3710 SSDs even though we have battery/capacitor
backed caches on the controllers: the latencies have started to rise
nevertheless when there are long periods of write intensive activity.
I'm currently pondering if we should bypass the write-cache for the
SSDs. The cache is obviously less effective on them and might be more
useful overall if it is dedicated to the rotating disks. Does anyone
have test results with cache active/inactive on SSD journals with HP
Smart Array p420 or p840 controllers?

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-11 Thread Alex Gorbachev
FWIW. Based on the excellent research by Mark Nelson (
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
we have dropped SSD journals altogether, and instead went for the battery
protected controller writeback cache.

Benefits:

- No negative force multiplier with one SSD failure taking down multiple
OSDs

- OSD portability: move OSD drives across nodes

- OSD recovery: stick them into a surviving OSD node and they keep working

I agree on size=3, seems to be safest in all situations.

Regards,
Alex

On Thu, Jul 9, 2015 at 6:38 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 So, I was running with size=2, until we had a network interface on an
 OSD node go faulty, and start corrupting data. Because ceph couldn't tell
 which copy was right it caused all sorts of trouble. I might have been able
 to recover more gracefully had I caught the problem sooner and been able to
 identify the root right away, but as it was, we ended up labeling every VM
 in the cluster suspect destroying the whole thing and restoring from
 backups. I didn't end up managing to find the root of the problem until I
 was rebuilding the cluster and noticed one node felt weird when I was
 ssh'd into it. It was painful.

 We are currently running important vms from a ceph pool with size=3, and
 more disposable ones from a size=2 pool, and that seems to be a reasonable
 tradeoff so far, giving us a bit more IO overhead tha nwe would have
 running 3 for everything, but still having safety where we need it.

 QH

 On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke 
 goetz.reini...@filmakademie.de wrote:

 Hi Warren,

 thanks for that feedback. regarding the 2 or 3 copies we had a lot of
 internal discussions and lots of pros and cons on 2 and 3 :) … and finally
 decided to give 2 copies in the first - now called evaluation cluster - a
 chance to prove.

 I bet in 2016 we will see, if that was a good decision or bad and data
 los is in that scenario ok. We evaluate. :)

 Regarding one P3700 for 12 SATA disks I do get it right, that if that
 P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me
 from my current knowledge. Or are the P3700 so much more reliable than the
 eg. S3500 or S3700?

 Or is the suggestion with the P3700 if we go in the direction of 20+
 nodes and till than stay without SSDs for journaling.

 I really appreciate your thoughts and feedback and I’m aware of the fact
 that building a ceph cluster is some sort of knowing the specs,
 configuration option, math, experience, modification and feedback from best
 practices real world clusters. Finally all clusters are unique in some way
 and what works for one will not work for an other.

 Thanks for feedback, 100 kowtows . Götz



  Am 09.07.2015 um 16:58 schrieb Wang, Warren 
 warren_w...@cable.comcast.com:
 
  You'll take a noticeable hit on write latency. Whether or not it's
 tolerable will be up to you and the workload you have to capture. Large
 file operations are throughput efficient without an SSD journal, as long as
 you have enough spindles.
 
  About the Intel P3700, you will only need 1 to keep up with 12 SATA
 drives. The 400 GB is probably okay if you keep the journal sizes small,
 but the 800 is probably safer if you plan on leaving these in production
 for a few years. Depends on the turnover of data on the servers.
 
  The dual disk failure comment is pointing out that you are more exposed
 for data loss with 2 copies. You do need to understand that there is a
 possibility for 2 drives to fail either simultaneously, or one before the
 cluster is repaired. As usual, this is going to be a decision you need to
 decide if it's acceptable or not. We have many clusters, and some are 2,
 and others are 3. If your data resides nowhere else, then 3 copies is the
 safe thing to do. That's getting harder and harder to justify though, when
 the price of other storage solutions using erasure coding continues to
 plummet.
 
  Warren
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Götz Reinicke - IT Koordinator
  Sent: Thursday, July 09, 2015 4:47 AM
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Real world benefit from SSD Journals for a
 more read than write cluster
 
  Hi Christian,
  Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
  Hello,
 
  On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator
 wrote:
 
  Hi again,
 
  time is passing, so is my budget :-/ and I have to recheck the
  options for a starter cluster. An expansion next year for may be an
  openstack installation or more performance if the demands rise is
  possible. The starter could always be used as test or slow dark
 archive.
 
  At the beginning I was at 16SATA OSDs with 4 SSDs for journal per
  node, but now I'm looking for 12 SATA OSDs without SSD journal. Less
  performance, less capacity I know. But thats ok!
 
  Leave the space to upgrade

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Wang, Warren
You'll take a noticeable hit on write latency. Whether or not it's tolerable 
will be up to you and the workload you have to capture. Large file operations 
are throughput efficient without an SSD journal, as long as you have enough 
spindles.

About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 
400 GB is probably okay if you keep the journal sizes small, but the 800 is 
probably safer if you plan on leaving these in production for a few years. 
Depends on the turnover of data on the servers.

The dual disk failure comment is pointing out that you are more exposed for 
data loss with 2 copies. You do need to understand that there is a possibility 
for 2 drives to fail either simultaneously, or one before the cluster is 
repaired. As usual, this is going to be a decision you need to decide if it's 
acceptable or not. We have many clusters, and some are 2, and others are 3. If 
your data resides nowhere else, then 3 copies is the safe thing to do. That's 
getting harder and harder to justify though, when the price of other storage 
solutions using erasure coding continues to plummet.

Warren

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz 
Reinicke - IT Koordinator
Sent: Thursday, July 09, 2015 4:47 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read 
than write cluster

Hi Christian,
Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
 Hello,
 
 On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
 
 Hi again,

 time is passing, so is my budget :-/ and I have to recheck the 
 options for a starter cluster. An expansion next year for may be an 
 openstack installation or more performance if the demands rise is 
 possible. The starter could always be used as test or slow dark archive.

 At the beginning I was at 16SATA OSDs with 4 SSDs for journal per 
 node, but now I'm looking for 12 SATA OSDs without SSD journal. Less 
 performance, less capacity I know. But thats ok!

 Leave the space to upgrade these nodes with SSDs in the future.
 If your cluster grows large enough (more than 20 nodes) even a single
 P3700 might do the trick and will need only a PCIe slot.

If I get you right, the 12Disk is not a bad idea, if there would be the need of 
SSD Journal I can add the PCIe P3700.

In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

God or bad idea?

 
 There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.

 Danger, Will Robinson.
 This is essentially a RAID5 and you're plain asking for a double disk 
 failure to happen.

May be I do not understand that. size = 2 I think is more sort of raid1 ... ? 
And why am I asking for for a double disk failure?

To less nodes, OSDs or because of the size = 2.

 
 See this recent thread:
 calculating maximum number of disk and node failure that can be 
 handled by cluster with out data loss
 for some discussion and python script which you will need to modify 
 for
 2 disk replication.
 
 With a RAID5 failure calculator you're at 1 data loss event per 3.5 
 years...
 

Thanks for that thread, but I dont get the point out of it for me.

I see that calculating the reliability is some sort of complex math ...

 The workload I expect is more writes of may be some GB of Office 
 files per day and some TB of larger video Files from a few users per week.

 At the end of this year we calculate to have +- 60 to 80 TB of lager 
 videofiles in that cluster, which are accessed from time to time.

 Any suggestion on the drop of ssd journals?

 You will miss them when the cluster does write, be it from clients or 
 when re-balancing a lost OSD.

I can imagine, that I might miss the SSD Journal, but if I can add the
P3700 later I feel comfy with it for now. Budget and evaluation related.

Thanks for your helpful input and feedback. /Götz

--
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium 
für Wissenschaft, Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread David Burley
If you can accept the failure domain, we find 12:1 ratio of SATA spinners
to a 400GB P3700 is reasonable. Benchmarks can saturate it, but it is
entirely bored in our real-world workload and only 30-50% utilized during
backfills. I am sure one could go even further than 12:1 if they wanted,
but we haven't tested.

On Thu, Jul 9, 2015 at 4:47 AM, Götz Reinicke - IT Koordinator 
goetz.reini...@filmakademie.de wrote:

 Hi Christian,
 Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
  Hello,
 
  On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
 
  Hi again,
 
  time is passing, so is my budget :-/ and I have to recheck the options
  for a starter cluster. An expansion next year for may be an openstack
  installation or more performance if the demands rise is possible. The
  starter could always be used as test or slow dark archive.
 
  At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
  but now I'm looking for 12 SATA OSDs without SSD journal. Less
  performance, less capacity I know. But thats ok!
 
  Leave the space to upgrade these nodes with SSDs in the future.
  If your cluster grows large enough (more than 20 nodes) even a single
  P3700 might do the trick and will need only a PCIe slot.

 If I get you right, the 12Disk is not a bad idea, if there would be the
 need of SSD Journal I can add the PCIe P3700.

 In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

 God or bad idea?

 
  There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
 
  Danger, Will Robinson.
  This is essentially a RAID5 and you're plain asking for a double disk
  failure to happen.

 May be I do not understand that. size = 2 I think is more sort of raid1
 ... ? And why am I asking for for a double disk failure?

 To less nodes, OSDs or because of the size = 2.

 
  See this recent thread:
  calculating maximum number of disk and node failure that can be handled
  by cluster with out data loss
  for some discussion and python script which you will need to modify for
  2 disk replication.
 
  With a RAID5 failure calculator you're at 1 data loss event per 3.5
  years...
 

 Thanks for that thread, but I dont get the point out of it for me.

 I see that calculating the reliability is some sort of complex math ...

  The workload I expect is more writes of may be some GB of Office files
  per day and some TB of larger video Files from a few users per week.
 
  At the end of this year we calculate to have +- 60 to 80 TB of lager
  videofiles in that cluster, which are accessed from time to time.
 
  Any suggestion on the drop of ssd journals?
 
  You will miss them when the cluster does write, be it from clients or
 when
  re-balancing a lost OSD.

 I can imagine, that I might miss the SSD Journal, but if I can add the
 P3700 later I feel comfy with it for now. Budget and evaluation related.

 Thanks for your helpful input and feedback. /Götz

 --
 Götz Reinicke
 IT-Koordinator

 Tel. +49 7141 969 82420
 E-Mail goetz.reini...@filmakademie.de

 Filmakademie Baden-Württemberg GmbH
 Akademiehof 10
 71638 Ludwigsburg
 www.filmakademie.de

 Eintragung Amtsgericht Stuttgart HRB 205016

 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
 Staatssekretär im Ministerium für Wissenschaft,
 Forschung und Kunst Baden-Württemberg

 Geschäftsführer: Prof. Thomas Schadt



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Christian Balzer

Hello,

On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:

 Hi again,
 
 time is passing, so is my budget :-/ and I have to recheck the options
 for a starter cluster. An expansion next year for may be an openstack
 installation or more performance if the demands rise is possible. The
 starter could always be used as test or slow dark archive.
 
 At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
 but now I'm looking for 12 SATA OSDs without SSD journal. Less
 performance, less capacity I know. But thats ok!
 
Leave the space to upgrade these nodes with SSDs in the future.
If your cluster grows large enough (more than 20 nodes) even a single
P3700 might do the trick and will need only a PCIe slot.

 There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
 
Danger, Will Robinson.
This is essentially a RAID5 and you're plain asking for a double disk
failure to happen.

See this recent thread:
calculating maximum number of disk and node failure that can be handled
by cluster with out data loss
for some discussion and python script which you will need to modify for
2 disk replication.

With a RAID5 failure calculator you're at 1 data loss event per 3.5
years...

 The workload I expect is more writes of may be some GB of Office files
 per day and some TB of larger video Files from a few users per week.
 
 At the end of this year we calculate to have +- 60 to 80 TB of lager
 videofiles in that cluster, which are accessed from time to time.
 
 Any suggestion on the drop of ssd journals?
 
You will miss them when the cluster does write, be it from clients or when
re-balancing a lost OSD.

Christian
   Thanks as always for your feedback . Götz
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Götz Reinicke - IT Koordinator
Hi again,

time is passing, so is my budget :-/ and I have to recheck the options
for a starter cluster. An expansion next year for may be an openstack
installation or more performance if the demands rise is possible. The
starter could always be used as test or slow dark archive.

At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
but now I'm looking for 12 SATA OSDs without SSD journal. Less
performance, less capacity I know. But thats ok!

There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.

The workload I expect is more writes of may be some GB of Office files
per day and some TB of larger video Files from a few users per week.

At the end of this year we calculate to have +- 60 to 80 TB of lager
videofiles in that cluster, which are accessed from time to time.

Any suggestion on the drop of ssd journals?

Thanks as always for your feedback . Götz

-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Götz Reinicke - IT Koordinator
Hi Christian,
Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
 Hello,
 
 On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
 
 Hi again,

 time is passing, so is my budget :-/ and I have to recheck the options
 for a starter cluster. An expansion next year for may be an openstack
 installation or more performance if the demands rise is possible. The
 starter could always be used as test or slow dark archive.

 At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
 but now I'm looking for 12 SATA OSDs without SSD journal. Less
 performance, less capacity I know. But thats ok!

 Leave the space to upgrade these nodes with SSDs in the future.
 If your cluster grows large enough (more than 20 nodes) even a single
 P3700 might do the trick and will need only a PCIe slot.

If I get you right, the 12Disk is not a bad idea, if there would be the
need of SSD Journal I can add the PCIe P3700.

In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

God or bad idea?

 
 There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.

 Danger, Will Robinson.
 This is essentially a RAID5 and you're plain asking for a double disk
 failure to happen.

May be I do not understand that. size = 2 I think is more sort of raid1
... ? And why am I asking for for a double disk failure?

To less nodes, OSDs or because of the size = 2.

 
 See this recent thread:
 calculating maximum number of disk and node failure that can be handled
 by cluster with out data loss
 for some discussion and python script which you will need to modify for
 2 disk replication.
 
 With a RAID5 failure calculator you're at 1 data loss event per 3.5
 years...
 

Thanks for that thread, but I dont get the point out of it for me.

I see that calculating the reliability is some sort of complex math ...

 The workload I expect is more writes of may be some GB of Office files
 per day and some TB of larger video Files from a few users per week.

 At the end of this year we calculate to have +- 60 to 80 TB of lager
 videofiles in that cluster, which are accessed from time to time.

 Any suggestion on the drop of ssd journals?

 You will miss them when the cluster does write, be it from clients or when
 re-balancing a lost OSD.

I can imagine, that I might miss the SSD Journal, but if I can add the
P3700 later I feel comfy with it for now. Budget and evaluation related.

Thanks for your helpful input and feedback. /Götz

-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Quentin Hartman
So, I was running with size=2, until we had a network interface on an
OSD node go faulty, and start corrupting data. Because ceph couldn't tell
which copy was right it caused all sorts of trouble. I might have been able
to recover more gracefully had I caught the problem sooner and been able to
identify the root right away, but as it was, we ended up labeling every VM
in the cluster suspect destroying the whole thing and restoring from
backups. I didn't end up managing to find the root of the problem until I
was rebuilding the cluster and noticed one node felt weird when I was
ssh'd into it. It was painful.

We are currently running important vms from a ceph pool with size=3, and
more disposable ones from a size=2 pool, and that seems to be a reasonable
tradeoff so far, giving us a bit more IO overhead tha nwe would have
running 3 for everything, but still having safety where we need it.

QH

On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke 
goetz.reini...@filmakademie.de wrote:

 Hi Warren,

 thanks for that feedback. regarding the 2 or 3 copies we had a lot of
 internal discussions and lots of pros and cons on 2 and 3 :) … and finally
 decided to give 2 copies in the first - now called evaluation cluster - a
 chance to prove.

 I bet in 2016 we will see, if that was a good decision or bad and data los
 is in that scenario ok. We evaluate. :)

 Regarding one P3700 for 12 SATA disks I do get it right, that if that
 P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me
 from my current knowledge. Or are the P3700 so much more reliable than the
 eg. S3500 or S3700?

 Or is the suggestion with the P3700 if we go in the direction of 20+ nodes
 and till than stay without SSDs for journaling.

 I really appreciate your thoughts and feedback and I’m aware of the fact
 that building a ceph cluster is some sort of knowing the specs,
 configuration option, math, experience, modification and feedback from best
 practices real world clusters. Finally all clusters are unique in some way
 and what works for one will not work for an other.

 Thanks for feedback, 100 kowtows . Götz



  Am 09.07.2015 um 16:58 schrieb Wang, Warren 
 warren_w...@cable.comcast.com:
 
  You'll take a noticeable hit on write latency. Whether or not it's
 tolerable will be up to you and the workload you have to capture. Large
 file operations are throughput efficient without an SSD journal, as long as
 you have enough spindles.
 
  About the Intel P3700, you will only need 1 to keep up with 12 SATA
 drives. The 400 GB is probably okay if you keep the journal sizes small,
 but the 800 is probably safer if you plan on leaving these in production
 for a few years. Depends on the turnover of data on the servers.
 
  The dual disk failure comment is pointing out that you are more exposed
 for data loss with 2 copies. You do need to understand that there is a
 possibility for 2 drives to fail either simultaneously, or one before the
 cluster is repaired. As usual, this is going to be a decision you need to
 decide if it's acceptable or not. We have many clusters, and some are 2,
 and others are 3. If your data resides nowhere else, then 3 copies is the
 safe thing to do. That's getting harder and harder to justify though, when
 the price of other storage solutions using erasure coding continues to
 plummet.
 
  Warren
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Götz Reinicke - IT Koordinator
  Sent: Thursday, July 09, 2015 4:47 AM
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Real world benefit from SSD Journals for a
 more read than write cluster
 
  Hi Christian,
  Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
  Hello,
 
  On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
 
  Hi again,
 
  time is passing, so is my budget :-/ and I have to recheck the
  options for a starter cluster. An expansion next year for may be an
  openstack installation or more performance if the demands rise is
  possible. The starter could always be used as test or slow dark
 archive.
 
  At the beginning I was at 16SATA OSDs with 4 SSDs for journal per
  node, but now I'm looking for 12 SATA OSDs without SSD journal. Less
  performance, less capacity I know. But thats ok!
 
  Leave the space to upgrade these nodes with SSDs in the future.
  If your cluster grows large enough (more than 20 nodes) even a single
  P3700 might do the trick and will need only a PCIe slot.
 
  If I get you right, the 12Disk is not a bad idea, if there would be the
 need of SSD Journal I can add the PCIe P3700.
 
  In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
 
  God or bad idea?
 
 
  There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
 
  Danger, Will Robinson.
  This is essentially a RAID5 and you're plain asking for a double disk
  failure to happen.
 
  May be I do not understand that. size = 2 I think is more sort of raid1

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Götz Reinicke
Hi Warren,

thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal 
discussions and lots of pros and cons on 2 and 3 :) … and finally decided to 
give 2 copies in the first - now called evaluation cluster - a chance to prove.

I bet in 2016 we will see, if that was a good decision or bad and data los is 
in that scenario ok. We evaluate. :)

Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 
fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my 
current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or 
S3700?

Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and 
till than stay without SSDs for journaling.

I really appreciate your thoughts and feedback and I’m aware of the fact that 
building a ceph cluster is some sort of knowing the specs, configuration 
option, math, experience, modification and feedback from best practices real 
world clusters. Finally all clusters are unique in some way and what works for 
one will not work for an other.

Thanks for feedback, 100 kowtows . Götz


 
 Am 09.07.2015 um 16:58 schrieb Wang, Warren warren_w...@cable.comcast.com:
 
 You'll take a noticeable hit on write latency. Whether or not it's tolerable 
 will be up to you and the workload you have to capture. Large file operations 
 are throughput efficient without an SSD journal, as long as you have enough 
 spindles.
 
 About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. 
 The 400 GB is probably okay if you keep the journal sizes small, but the 800 
 is probably safer if you plan on leaving these in production for a few years. 
 Depends on the turnover of data on the servers.
 
 The dual disk failure comment is pointing out that you are more exposed for 
 data loss with 2 copies. You do need to understand that there is a 
 possibility for 2 drives to fail either simultaneously, or one before the 
 cluster is repaired. As usual, this is going to be a decision you need to 
 decide if it's acceptable or not. We have many clusters, and some are 2, and 
 others are 3. If your data resides nowhere else, then 3 copies is the safe 
 thing to do. That's getting harder and harder to justify though, when the 
 price of other storage solutions using erasure coding continues to plummet.
 
 Warren
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz 
 Reinicke - IT Koordinator
 Sent: Thursday, July 09, 2015 4:47 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more 
 read than write cluster
 
 Hi Christian,
 Am 09.07.15 um 09:36 schrieb Christian Balzer:
 
 Hello,
 
 On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
 
 Hi again,
 
 time is passing, so is my budget :-/ and I have to recheck the 
 options for a starter cluster. An expansion next year for may be an 
 openstack installation or more performance if the demands rise is 
 possible. The starter could always be used as test or slow dark archive.
 
 At the beginning I was at 16SATA OSDs with 4 SSDs for journal per 
 node, but now I'm looking for 12 SATA OSDs without SSD journal. Less 
 performance, less capacity I know. But thats ok!
 
 Leave the space to upgrade these nodes with SSDs in the future.
 If your cluster grows large enough (more than 20 nodes) even a single
 P3700 might do the trick and will need only a PCIe slot.
 
 If I get you right, the 12Disk is not a bad idea, if there would be the need 
 of SSD Journal I can add the PCIe P3700.
 
 In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
 
 God or bad idea?
 
 
 There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
 
 Danger, Will Robinson.
 This is essentially a RAID5 and you're plain asking for a double disk 
 failure to happen.
 
 May be I do not understand that. size = 2 I think is more sort of raid1 ... ? 
 And why am I asking for for a double disk failure?
 
 To less nodes, OSDs or because of the size = 2.
 
 
 See this recent thread:
 calculating maximum number of disk and node failure that can be 
 handled by cluster with out data loss
 for some discussion and python script which you will need to modify 
 for
 2 disk replication.
 
 With a RAID5 failure calculator you're at 1 data loss event per 3.5 
 years...
 
 
 Thanks for that thread, but I dont get the point out of it for me.
 
 I see that calculating the reliability is some sort of complex math ...
 
 The workload I expect is more writes of may be some GB of Office 
 files per day and some TB of larger video Files from a few users per week.
 
 At the end of this year we calculate to have +- 60 to 80 TB of lager 
 videofiles in that cluster, which are accessed from time to time.
 
 Any suggestion on the drop of ssd journals?
 
 You will miss them when the cluster does write, be it from clients or 
 when re-balancing a lost OSD.
 
 I can imagine, that I