RE: Which SSD method is better for performance?

Sage Weil Tue, 21 Feb 2012 13:36:17 -0800

On Tue, 21 Feb 2012, Paul Pettigrew wrote:
> G'day Greg, thanks for the fast response.
> 
> Yes, I forgot to explicitly state the Journal would go to SATA Journals in 
> CASE1, and it is easy to appreciate the performance impact of this case as 
> you documented nicely in your response.
> 
> Re your second point: 
> > The other big advantage an SSD provides is in write latency; if you're 
> > journaling on an SSD you can send things to disk and get a commit back 
> > without having to wait on rotating media. How big an impact that will 
> > make will depend on your other config options and use case, though.
> 
> Are you able to detail which config options tune this, and an example 
> use case to illustrate?


Actually, I don't think there are many config options to worry about.  

The easiest way to see this latency is to do something like

 rados mkpool foo
 rados -p foo bench 30 write -b 4096 -t 1

which will do a single small sync io at a time.  You'll notice a big 
difference depending on whether your journal is a file, raw partition, 
SSD, or NVRAM.

When you have many parallel IOs (-t 100), you might also see a difference 
with a raw partition if you enable aio on the journal (journal aio = true 
in ceph.conf).  Maybe.  We haven't tuned that yet.

sage



> 
> Many thanks
> 
> Paul
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Gregory Farnum
> Sent: Tuesday, 21 February 2012 10:50 AM
> To: Paul Pettigrew
> Cc: Sage Weil; Wido den Hollander; [email protected]
> Subject: Re: Which SSD method is better for performance?
> 
> On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew <[email protected]> 
> wrote:
> > Thanks Sage
> >
> > So following through by two examples, to confirm my understanding........
> >
> > HDD SPECS:
> > 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s 
> > each 1x SSD able to do sustained read/write speed of 475MB/s
> >
> > CASE1
> > (not using SSD)
> > 8x OSD's each for the SATA HDD's
> > Therefore able to parallelise IO operations Sustained write sent to 
> > Ceph of very large file say 500GB (therefore caches all used up and 
> > bottleneck becomes SATA IO speed) Gives 8x 138MB/s = 1,104 MB/s
> >
> > CASE 2
> > (using 1x SSD)
> > SSD partitioned into 8x separate partitions, 1x for each OSD Sustained 
> > write (with OSD-Journal to SSD) sent to Ceph of very large file (say 
> > 500GB) Write spilt across 8x OSD-Journal partitions on the single SSD 
> > = limited to aggregate of 475MB/s
> >
> > ANALYSIS:
> > If my examples are how Ceph operates, then it is necessary to not exceed a 
> > ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> > bottleneck.
> >
> > Is this analysis accurate? Are there other benefits that SSD provide 
> > (including in non-sustained peak write performance use case) that would 
> > otherwise justify their usage? What ratios are other users sticking to when 
> > deciding for their design?
> 
> Well, you seem to be leaving out the journals entirely in the first case. You 
> could put them on a separate partition on the SATA disks if you wanted, which 
> (on a modern drive) would net you half the single-stream throughput, or 
> ~552MB/s aggregate.
> 
> The other big advantage an SSD provides is in write latency; if you're 
> journaling on an SSD you can send things to disk and get a commit back 
> without having to wait on rotating media. How big an impact that will make 
> will depend on your other config options and use case, though.
> -Greg
> 
> >
> > Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki 
> > page I will be offering to Sage to include in the main Ceph wiki site.
> >
> > Paul
> >
> >
> >
> > -----Original Message-----
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Sage Weil
> > Sent: Monday, 20 February 2012 1:16 PM
> > To: Paul Pettigrew
> > Cc: Wido den Hollander; [email protected]
> > Subject: RE: Which SSD method is better for performance?
> >
> > On Mon, 20 Feb 2012, Paul Pettigrew wrote:
> >> And secondly, should the SSD Journal sizes be large or small?  Ie, is 
> >> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
> >> possible? There are many forum posts that say 100-200MB will suffice.
> >> A quick piece of advice will save us hopefully sever days of 
> >> reconfiguring and benchmarking the Cluster :-)
> >
> > ceph-osd will periodically do a 'commit' to ensure that stuff in the 
> > journal is written safely to the file system.  On btrfs that's a snapshot, 
> > on anything else it's a sync(2).  When the journals hits 50% we trigger a 
> > commit, or when a timer expires (I think 30 seconds by default).  There is 
> > some overhead associated with the sync/snapshot, so less is generally 
> > better.
> >
> > A decent rule of thumb is probably to make the journal big enough to 
> > consume sustained writes for 10-30 seconds.  On modern disks, that's 
> > probably 1-3GB?  If the journal is on the same spindle as the fs, it'll be 
> > probably half that...
> > </hand waving>
> >
> > sage
> >
> >
> >
> >>
> >> Thanks
> >>
> >> Paul
> >>
> >>
> >> -----Original Message-----
> >> From: [email protected] 
> >> [mailto:[email protected]] On Behalf Of Wido den 
> >> Hollander
> >> Sent: Tuesday, 14 February 2012 10:46 PM
> >> To: Paul Pettigrew
> >> Cc: [email protected]
> >> Subject: Re: Which SSD method is better for performance?
> >>
> >> Hi,
> >>
> >> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
> >> > G'day all
> >> >
> >> > About to commence an R&D eval of the Ceph platform having been impressed 
> >> > with the momentum achieved over the past 12mths.
> >> >
> >> > I have one question re design before rolling out to metal........
> >> >
> >> > I will be using 1x SSD drive per storage server node (assume it is 
> >> > /dev/sdb for this discussion), and cannot readily determine the 
> >> > pro/con's for the two methods of using it for OSD-Journal, being:
> >> > #1. place it in the main [osd] stanza and reference the whole drive 
> >> > as a single partition; or
> >>
> >> That won't work. If you do that all OSD's will try to open the journal.
> >> The journal for each OSD has to be unique.
> >>
> >> > #2. partition up the disk, so 1x partition per SATA HDD, and place 
> >> > each partition in the [osd.N] portion
> >>
> >> That would be your best option.
> >>
> >> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
> >>
> >> the VG "data" is placed on a SSD (Intel X25-M).
> >>
> >> >
> >> > So if I were to code #1 in the ceph.conf file, it would be:
> >> > [osd]
> >> > osd journal = /dev/sdb
> >> >
> >> > Or, #2 would be like:
> >> > [osd.0]
> >> >          host = ceph1
> >> >          btrfs devs = /dev/sdc
> >> >          osd journal = /dev/sdb5
> >> > [osd.1]
> >> >          host = ceph1
> >> >          btrfs devs = /dev/sdd
> >> >          osd journal = /dev/sdb6
> >> > [osd.2]
> >> >          host = ceph1
> >> >          btrfs devs = /dev/sde
> >> >          osd journal = /dev/sdb7
> >> > [osd.3]
> >> >          host = ceph1
> >> >          btrfs devs = /dev/sdf
> >> >          osd journal = /dev/sdb8
> >> >
> >> > I am asking therefore, is the added work (and constraints) of specifying 
> >> > down to individual partitions per #2 worth it in performance gains? Does 
> >> > it not also have a constraint, in that if I wanted to add more HDD's 
> >> > into the server (we buy 45 bay units, and typically provision HDD's "on 
> >> > demand" i.e. 15x at a time as usage grows), I would have to additionally 
> >> > partition the SSD (taking it offline) - but if it were #1 option, I 
> >> > would only have to add more [osd.N] sections (and not have to worry 
> >> > about getting the SSD with 45x partitions)?
> >> >
> >>
> >> You'd still have to go for #2. However, running 45 OSD's on a single 
> >> machine is a bit tricky imho.
> >>
> >> If that machine fails you would loose 45 OSD's at once, that will put a 
> >> lot of stress on the recovery of your cluster.
> >>
> >> You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB 
> >> of RAM I guess.
> >>
> >> A last note, if you use a SSD for your journaling, make sure that you 
> >> align your partitions which the page size of the SSD, otherwise you'd run 
> >> into the write amplification of the SSD, resulting in a performance loss.
> >>
> >> Wido
> >>
> >> > One final related question, if I were to use #1 method (which I would 
> >> > prefer if there is no material performance or other reason to use #2), 
> >> > then that specification (i.e. the "osd journal = /dev/sdb") SSD disk 
> >> > reference would have to be identical on all other hardware nodes, yes (I 
> >> > want to use the same ceph.conf file on all servers per the doco 
> >> > recommendations)? What would happen if for example, the SSD was on 
> >> > /dev/sde on a new node added into the cluster? References to 
> >> > /dev/disk/by-id etc are clearly no help, so should a symlink be used 
> >> > from the get-go? Eg something like "ln -s /dev/sdb /srv/ssd" on one box, 
> >> > and  "ln -s /dev/sde /srv/ssd" on the other box, so that in the [osd] 
> >> > section we could use this line which would find the SSD disk on all 
> >> > nodes "osd journal = /srv/ssd"?
> >> >
> >> > Many thanks for any advice provided.
> >> >
> >> > Cheers
> >> >
> >> > Paul
> >> >
> >> >
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> > in the body of a message to [email protected] More 
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to [email protected] More majordomo 
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to [email protected] More majordomo 
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

RE: Which SSD method is better for performance?

Reply via email to