On Tue, 21 Feb 2012, Paul Pettigrew wrote:
> Thanks Sage
> 
> So following through by two examples, to confirm my understanding........
> 
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s each
> 1x SSD able to do sustained read/write speed of 475MB/s
> 
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations
> Sustained write sent to Ceph of very large file say 500GB (therefore caches 
> all used up and bottleneck becomes SATA IO speed) 
> Gives 8x 138MB/s = 1,104 MB/s
> 
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD
> Sustained write (with OSD-Journal to SSD) sent to Ceph of very large file 
> (say 500GB)
> Write spilt across 8x OSD-Journal partitions on the single SSD = limited to 
> aggregate of 475MB/s
> 
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
> 
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Modulo the missing journals in case 1, I think so.  For most people, 
though, it is pretty rare to try to saturate every disk... there is 
usually some small write and/or read activity going on, and maxing out the 
SSD isn't a problem.  It sounds like you have bonded 10gige interfaces to 
drive this?

It may be possible for ceph-osd to skip the journal when it isn't able to 
keep up the with file system.  That will give you crummy latency (since 
writes won't commit until the fs does a sync/commit), but the latency is 
already bad if the journal is behind.  We already do something similar if 
the journal fills up.  (This would only work with btrfs; for other file 
systems we also need the journal to preserve transaction atomicity.)

sage


> 
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
> 
> Paul
> 
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; [email protected]
> Subject: RE: Which SSD method is better for performance?
> 
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
> > And secondly, should the SSD Journal sizes be large or small?  Ie, is 
> > say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
> > possible? There are many forum posts that say 100-200MB will suffice.
> > A quick piece of advice will save us hopefully sever days of 
> > reconfiguring and benchmarking the Cluster :-)
> 
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
> 
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> </hand waving>
> 
> sage
> 
> 
> 
> > 
> > Thanks
> > 
> > Paul
> > 
> > 
> > -----Original Message-----
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Wido den 
> > Hollander
> > Sent: Tuesday, 14 February 2012 10:46 PM
> > To: Paul Pettigrew
> > Cc: [email protected]
> > Subject: Re: Which SSD method is better for performance?
> > 
> > Hi,
> > 
> > On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
> > > G'day all
> > >
> > > About to commence an R&D eval of the Ceph platform having been impressed 
> > > with the momentum achieved over the past 12mths.
> > >
> > > I have one question re design before rolling out to metal........
> > >
> > > I will be using 1x SSD drive per storage server node (assume it is 
> > > /dev/sdb for this discussion), and cannot readily determine the pro/con's 
> > > for the two methods of using it for OSD-Journal, being:
> > > #1. place it in the main [osd] stanza and reference the whole drive 
> > > as a single partition; or
> > 
> > That won't work. If you do that all OSD's will try to open the journal. 
> > The journal for each OSD has to be unique.
> > 
> > > #2. partition up the disk, so 1x partition per SATA HDD, and place 
> > > each partition in the [osd.N] portion
> > 
> > That would be your best option.
> > 
> > I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
> > 
> > the VG "data" is placed on a SSD (Intel X25-M).
> > 
> > >
> > > So if I were to code #1 in the ceph.conf file, it would be:
> > > [osd]
> > > osd journal = /dev/sdb
> > >
> > > Or, #2 would be like:
> > > [osd.0]
> > >          host = ceph1
> > >          btrfs devs = /dev/sdc
> > >          osd journal = /dev/sdb5
> > > [osd.1]
> > >          host = ceph1
> > >          btrfs devs = /dev/sdd
> > >          osd journal = /dev/sdb6
> > > [osd.2]
> > >          host = ceph1
> > >          btrfs devs = /dev/sde
> > >          osd journal = /dev/sdb7
> > > [osd.3]
> > >          host = ceph1
> > >          btrfs devs = /dev/sdf
> > >          osd journal = /dev/sdb8
> > >
> > > I am asking therefore, is the added work (and constraints) of specifying 
> > > down to individual partitions per #2 worth it in performance gains? Does 
> > > it not also have a constraint, in that if I wanted to add more HDD's into 
> > > the server (we buy 45 bay units, and typically provision HDD's "on 
> > > demand" i.e. 15x at a time as usage grows), I would have to additionally 
> > > partition the SSD (taking it offline) - but if it were #1 option, I would 
> > > only have to add more [osd.N] sections (and not have to worry about 
> > > getting the SSD with 45x partitions)?
> > >
> > 
> > You'd still have to go for #2. However, running 45 OSD's on a single 
> > machine is a bit tricky imho.
> > 
> > If that machine fails you would loose 45 OSD's at once, that will put a lot 
> > of stress on the recovery of your cluster.
> > 
> > You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB 
> > of RAM I guess.
> > 
> > A last note, if you use a SSD for your journaling, make sure that you align 
> > your partitions which the page size of the SSD, otherwise you'd run into 
> > the write amplification of the SSD, resulting in a performance loss.
> > 
> > Wido
> > 
> > > One final related question, if I were to use #1 method (which I would 
> > > prefer if there is no material performance or other reason to use #2), 
> > > then that specification (i.e. the "osd journal = /dev/sdb") SSD disk 
> > > reference would have to be identical on all other hardware nodes, yes (I 
> > > want to use the same ceph.conf file on all servers per the doco 
> > > recommendations)? What would happen if for example, the SSD was on 
> > > /dev/sde on a new node added into the cluster? References to 
> > > /dev/disk/by-id etc are clearly no help, so should a symlink be used from 
> > > the get-go? Eg something like "ln -s /dev/sdb /srv/ssd" on one box, and  
> > > "ln -s /dev/sde /srv/ssd" on the other box, so that in the [osd] section 
> > > we could use this line which would find the SSD disk on all nodes "osd 
> > > journal = /srv/ssd"?
> > >
> > > Many thanks for any advice provided.
> > >
> > > Cheers
> > >
> > > Paul
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to [email protected] More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to