RE: Which SSD method is better for performance?

Sage Weil Sun, 19 Feb 2012 19:16:34 -0800

On Mon, 20 Feb 2012, Paul Pettigrew wrote:
> And secondly, should the SSD Journal sizes be large or small?  Ie, is 
> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
> possible? There are many forum posts that say 100-200MB will suffice.  
> A quick piece of advice will save us hopefully sever days of 
> reconfiguring and benchmarking the Cluster :-)


ceph-osd will periodically do a 'commit' to ensure that stuff in the 
journal is written safely to the file system.  On btrfs that's a snapshot, 
on anything else it's a sync(2).  When the journals hits 50% we trigger a 
commit, or when a timer expires (I think 30 seconds by default).  There is 
some overhead associated with the sync/snapshot, so less is generally 
better.

A decent rule of thumb is probably to make the journal big enough to 
consume sustained writes for 10-30 seconds.  On modern disks, that's 
probably 1-3GB?  If the journal is on the same spindle as the fs, it'll be 
probably half that...
</hand waving>

sage



> 
> Thanks
> 
> Paul
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Wido den Hollander
> Sent: Tuesday, 14 February 2012 10:46 PM
> To: Paul Pettigrew
> Cc: [email protected]
> Subject: Re: Which SSD method is better for performance?
> 
> Hi,
> 
> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
> > G'day all
> >
> > About to commence an R&D eval of the Ceph platform having been impressed 
> > with the momentum achieved over the past 12mths.
> >
> > I have one question re design before rolling out to metal........
> >
> > I will be using 1x SSD drive per storage server node (assume it is /dev/sdb 
> > for this discussion), and cannot readily determine the pro/con's for the 
> > two methods of using it for OSD-Journal, being:
> > #1. place it in the main [osd] stanza and reference the whole drive as 
> > a single partition; or
> 
> That won't work. If you do that all OSD's will try to open the journal. 
> The journal for each OSD has to be unique.
> 
> > #2. partition up the disk, so 1x partition per SATA HDD, and place 
> > each partition in the [osd.N] portion
> 
> That would be your best option.
> 
> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
> 
> the VG "data" is placed on a SSD (Intel X25-M).
> 
> >
> > So if I were to code #1 in the ceph.conf file, it would be:
> > [osd]
> > osd journal = /dev/sdb
> >
> > Or, #2 would be like:
> > [osd.0]
> >          host = ceph1
> >          btrfs devs = /dev/sdc
> >          osd journal = /dev/sdb5
> > [osd.1]
> >          host = ceph1
> >          btrfs devs = /dev/sdd
> >          osd journal = /dev/sdb6
> > [osd.2]
> >          host = ceph1
> >          btrfs devs = /dev/sde
> >          osd journal = /dev/sdb7
> > [osd.3]
> >          host = ceph1
> >          btrfs devs = /dev/sdf
> >          osd journal = /dev/sdb8
> >
> > I am asking therefore, is the added work (and constraints) of specifying 
> > down to individual partitions per #2 worth it in performance gains? Does it 
> > not also have a constraint, in that if I wanted to add more HDD's into the 
> > server (we buy 45 bay units, and typically provision HDD's "on demand" i.e. 
> > 15x at a time as usage grows), I would have to additionally partition the 
> > SSD (taking it offline) - but if it were #1 option, I would only have to 
> > add more [osd.N] sections (and not have to worry about getting the SSD with 
> > 45x partitions)?
> >
> 
> You'd still have to go for #2. However, running 45 OSD's on a single machine 
> is a bit tricky imho.
> 
> If that machine fails you would loose 45 OSD's at once, that will put a lot 
> of stress on the recovery of your cluster.
> 
> You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB of 
> RAM I guess.
> 
> A last note, if you use a SSD for your journaling, make sure that you align 
> your partitions which the page size of the SSD, otherwise you'd run into the 
> write amplification of the SSD, resulting in a performance loss.
> 
> Wido
> 
> > One final related question, if I were to use #1 method (which I would 
> > prefer if there is no material performance or other reason to use #2), then 
> > that specification (i.e. the "osd journal = /dev/sdb") SSD disk reference 
> > would have to be identical on all other hardware nodes, yes (I want to use 
> > the same ceph.conf file on all servers per the doco recommendations)? What 
> > would happen if for example, the SSD was on /dev/sde on a new node added 
> > into the cluster? References to /dev/disk/by-id etc are clearly no help, so 
> > should a symlink be used from the get-go? Eg something like "ln -s /dev/sdb 
> > /srv/ssd" on one box, and  "ln -s /dev/sde /srv/ssd" on the other box, so 
> > that in the [osd] section we could use this line which would find the SSD 
> > disk on all nodes "osd journal = /srv/ssd"?
> >
> > Many thanks for any advice provided.
> >
> > Cheers
> >
> > Paul
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Which SSD method is better for performance?

Reply via email to