On Tue, 21 Feb 2012, Paul Pettigrew wrote: > G'day Greg, thanks for the fast response. > > Yes, I forgot to explicitly state the Journal would go to SATA Journals in > CASE1, and it is easy to appreciate the performance impact of this case as > you documented nicely in your response. > > Re your second point: > > The other big advantage an SSD provides is in write latency; if you're > > journaling on an SSD you can send things to disk and get a commit back > > without having to wait on rotating media. How big an impact that will > > make will depend on your other config options and use case, though. > > Are you able to detail which config options tune this, and an example > use case to illustrate?
Actually, I don't think there are many config options to worry about. The easiest way to see this latency is to do something like rados mkpool foo rados -p foo bench 30 write -b 4096 -t 1 which will do a single small sync io at a time. You'll notice a big difference depending on whether your journal is a file, raw partition, SSD, or NVRAM. When you have many parallel IOs (-t 100), you might also see a difference with a raw partition if you enable aio on the journal (journal aio = true in ceph.conf). Maybe. We haven't tuned that yet. sage > > Many thanks > > Paul > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Gregory Farnum > Sent: Tuesday, 21 February 2012 10:50 AM > To: Paul Pettigrew > Cc: Sage Weil; Wido den Hollander; [email protected] > Subject: Re: Which SSD method is better for performance? > > On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew <[email protected]> > wrote: > > Thanks Sage > > > > So following through by two examples, to confirm my understanding........ > > > > HDD SPECS: > > 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s > > each 1x SSD able to do sustained read/write speed of 475MB/s > > > > CASE1 > > (not using SSD) > > 8x OSD's each for the SATA HDD's > > Therefore able to parallelise IO operations Sustained write sent to > > Ceph of very large file say 500GB (therefore caches all used up and > > bottleneck becomes SATA IO speed) Gives 8x 138MB/s = 1,104 MB/s > > > > CASE 2 > > (using 1x SSD) > > SSD partitioned into 8x separate partitions, 1x for each OSD Sustained > > write (with OSD-Journal to SSD) sent to Ceph of very large file (say > > 500GB) Write spilt across 8x OSD-Journal partitions on the single SSD > > = limited to aggregate of 475MB/s > > > > ANALYSIS: > > If my examples are how Ceph operates, then it is necessary to not exceed a > > ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the > > bottleneck. > > > > Is this analysis accurate? Are there other benefits that SSD provide > > (including in non-sustained peak write performance use case) that would > > otherwise justify their usage? What ratios are other users sticking to when > > deciding for their design? > > Well, you seem to be leaving out the journals entirely in the first case. You > could put them on a separate partition on the SATA disks if you wanted, which > (on a modern drive) would net you half the single-stream throughput, or > ~552MB/s aggregate. > > The other big advantage an SSD provides is in write latency; if you're > journaling on an SSD you can send things to disk and get a commit back > without having to wait on rotating media. How big an impact that will make > will depend on your other config options and use case, though. > -Greg > > > > > Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki > > page I will be offering to Sage to include in the main Ceph wiki site. > > > > Paul > > > > > > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Sage Weil > > Sent: Monday, 20 February 2012 1:16 PM > > To: Paul Pettigrew > > Cc: Wido den Hollander; [email protected] > > Subject: RE: Which SSD method is better for performance? > > > > On Mon, 20 Feb 2012, Paul Pettigrew wrote: > >> And secondly, should the SSD Journal sizes be large or small? Ie, is > >> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as > >> possible? There are many forum posts that say 100-200MB will suffice. > >> A quick piece of advice will save us hopefully sever days of > >> reconfiguring and benchmarking the Cluster :-) > > > > ceph-osd will periodically do a 'commit' to ensure that stuff in the > > journal is written safely to the file system. On btrfs that's a snapshot, > > on anything else it's a sync(2). When the journals hits 50% we trigger a > > commit, or when a timer expires (I think 30 seconds by default). There is > > some overhead associated with the sync/snapshot, so less is generally > > better. > > > > A decent rule of thumb is probably to make the journal big enough to > > consume sustained writes for 10-30 seconds. On modern disks, that's > > probably 1-3GB? If the journal is on the same spindle as the fs, it'll be > > probably half that... > > </hand waving> > > > > sage > > > > > > > >> > >> Thanks > >> > >> Paul > >> > >> > >> -----Original Message----- > >> From: [email protected] > >> [mailto:[email protected]] On Behalf Of Wido den > >> Hollander > >> Sent: Tuesday, 14 February 2012 10:46 PM > >> To: Paul Pettigrew > >> Cc: [email protected] > >> Subject: Re: Which SSD method is better for performance? > >> > >> Hi, > >> > >> On 02/14/2012 01:39 AM, Paul Pettigrew wrote: > >> > G'day all > >> > > >> > About to commence an R&D eval of the Ceph platform having been impressed > >> > with the momentum achieved over the past 12mths. > >> > > >> > I have one question re design before rolling out to metal........ > >> > > >> > I will be using 1x SSD drive per storage server node (assume it is > >> > /dev/sdb for this discussion), and cannot readily determine the > >> > pro/con's for the two methods of using it for OSD-Journal, being: > >> > #1. place it in the main [osd] stanza and reference the whole drive > >> > as a single partition; or > >> > >> That won't work. If you do that all OSD's will try to open the journal. > >> The journal for each OSD has to be unique. > >> > >> > #2. partition up the disk, so 1x partition per SATA HDD, and place > >> > each partition in the [osd.N] portion > >> > >> That would be your best option. > >> > >> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf > >> > >> the VG "data" is placed on a SSD (Intel X25-M). > >> > >> > > >> > So if I were to code #1 in the ceph.conf file, it would be: > >> > [osd] > >> > osd journal = /dev/sdb > >> > > >> > Or, #2 would be like: > >> > [osd.0] > >> > host = ceph1 > >> > btrfs devs = /dev/sdc > >> > osd journal = /dev/sdb5 > >> > [osd.1] > >> > host = ceph1 > >> > btrfs devs = /dev/sdd > >> > osd journal = /dev/sdb6 > >> > [osd.2] > >> > host = ceph1 > >> > btrfs devs = /dev/sde > >> > osd journal = /dev/sdb7 > >> > [osd.3] > >> > host = ceph1 > >> > btrfs devs = /dev/sdf > >> > osd journal = /dev/sdb8 > >> > > >> > I am asking therefore, is the added work (and constraints) of specifying > >> > down to individual partitions per #2 worth it in performance gains? Does > >> > it not also have a constraint, in that if I wanted to add more HDD's > >> > into the server (we buy 45 bay units, and typically provision HDD's "on > >> > demand" i.e. 15x at a time as usage grows), I would have to additionally > >> > partition the SSD (taking it offline) - but if it were #1 option, I > >> > would only have to add more [osd.N] sections (and not have to worry > >> > about getting the SSD with 45x partitions)? > >> > > >> > >> You'd still have to go for #2. However, running 45 OSD's on a single > >> machine is a bit tricky imho. > >> > >> If that machine fails you would loose 45 OSD's at once, that will put a > >> lot of stress on the recovery of your cluster. > >> > >> You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB > >> of RAM I guess. > >> > >> A last note, if you use a SSD for your journaling, make sure that you > >> align your partitions which the page size of the SSD, otherwise you'd run > >> into the write amplification of the SSD, resulting in a performance loss. > >> > >> Wido > >> > >> > One final related question, if I were to use #1 method (which I would > >> > prefer if there is no material performance or other reason to use #2), > >> > then that specification (i.e. the "osd journal = /dev/sdb") SSD disk > >> > reference would have to be identical on all other hardware nodes, yes (I > >> > want to use the same ceph.conf file on all servers per the doco > >> > recommendations)? What would happen if for example, the SSD was on > >> > /dev/sde on a new node added into the cluster? References to > >> > /dev/disk/by-id etc are clearly no help, so should a symlink be used > >> > from the get-go? Eg something like "ln -s /dev/sdb /srv/ssd" on one box, > >> > and "ln -s /dev/sde /srv/ssd" on the other box, so that in the [osd] > >> > section we could use this line which would find the SSD disk on all > >> > nodes "osd journal = /srv/ssd"? > >> > > >> > Many thanks for any advice provided. > >> > > >> > Cheers > >> > > >> > Paul > >> > > >> > > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> > in the body of a message to [email protected] More > >> > majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> > >> > >> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to [email protected] More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
