On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew
<[email protected]> wrote:
> Thanks Sage
>
> So following through by two examples, to confirm my understanding........
>
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s each
> 1x SSD able to do sustained read/write speed of 475MB/s
>
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations
> Sustained write sent to Ceph of very large file say 500GB (therefore caches 
> all used up and bottleneck becomes SATA IO speed)
> Gives 8x 138MB/s = 1,104 MB/s
>
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD
> Sustained write (with OSD-Journal to SSD) sent to Ceph of very large file 
> (say 500GB)
> Write spilt across 8x OSD-Journal partitions on the single SSD = limited to 
> aggregate of 475MB/s
>
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
>
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Well, you seem to be leaving out the journals entirely in the first
case. You could put them on a separate partition on the SATA disks if
you wanted, which (on a modern drive) would net you half the
single-stream throughput, or ~552MB/s aggregate.

The other big advantage an SSD provides is in write latency; if you're
journaling on an SSD you can send things to disk and get a commit back
without having to wait on rotating media. How big an impact that will
make will depend on your other config options and use case, though.
-Greg

>
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
>
> Paul
>
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; [email protected]
> Subject: RE: Which SSD method is better for performance?
>
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
>> And secondly, should the SSD Journal sizes be large or small?  Ie, is
>> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as
>> possible? There are many forum posts that say 100-200MB will suffice.
>> A quick piece of advice will save us hopefully sever days of
>> reconfiguring and benchmarking the Cluster :-)
>
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
>
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> </hand waving>
>
> sage
>
>
>
>>
>> Thanks
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Wido den
>> Hollander
>> Sent: Tuesday, 14 February 2012 10:46 PM
>> To: Paul Pettigrew
>> Cc: [email protected]
>> Subject: Re: Which SSD method is better for performance?
>>
>> Hi,
>>
>> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
>> > G'day all
>> >
>> > About to commence an R&D eval of the Ceph platform having been impressed 
>> > with the momentum achieved over the past 12mths.
>> >
>> > I have one question re design before rolling out to metal........
>> >
>> > I will be using 1x SSD drive per storage server node (assume it is 
>> > /dev/sdb for this discussion), and cannot readily determine the pro/con's 
>> > for the two methods of using it for OSD-Journal, being:
>> > #1. place it in the main [osd] stanza and reference the whole drive
>> > as a single partition; or
>>
>> That won't work. If you do that all OSD's will try to open the journal.
>> The journal for each OSD has to be unique.
>>
>> > #2. partition up the disk, so 1x partition per SATA HDD, and place
>> > each partition in the [osd.N] portion
>>
>> That would be your best option.
>>
>> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
>>
>> the VG "data" is placed on a SSD (Intel X25-M).
>>
>> >
>> > So if I were to code #1 in the ceph.conf file, it would be:
>> > [osd]
>> > osd journal = /dev/sdb
>> >
>> > Or, #2 would be like:
>> > [osd.0]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdc
>> >          osd journal = /dev/sdb5
>> > [osd.1]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdd
>> >          osd journal = /dev/sdb6
>> > [osd.2]
>> >          host = ceph1
>> >          btrfs devs = /dev/sde
>> >          osd journal = /dev/sdb7
>> > [osd.3]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdf
>> >          osd journal = /dev/sdb8
>> >
>> > I am asking therefore, is the added work (and constraints) of specifying 
>> > down to individual partitions per #2 worth it in performance gains? Does 
>> > it not also have a constraint, in that if I wanted to add more HDD's into 
>> > the server (we buy 45 bay units, and typically provision HDD's "on demand" 
>> > i.e. 15x at a time as usage grows), I would have to additionally partition 
>> > the SSD (taking it offline) - but if it were #1 option, I would only have 
>> > to add more [osd.N] sections (and not have to worry about getting the SSD 
>> > with 45x partitions)?
>> >
>>
>> You'd still have to go for #2. However, running 45 OSD's on a single machine 
>> is a bit tricky imho.
>>
>> If that machine fails you would loose 45 OSD's at once, that will put a lot 
>> of stress on the recovery of your cluster.
>>
>> You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB of 
>> RAM I guess.
>>
>> A last note, if you use a SSD for your journaling, make sure that you align 
>> your partitions which the page size of the SSD, otherwise you'd run into the 
>> write amplification of the SSD, resulting in a performance loss.
>>
>> Wido
>>
>> > One final related question, if I were to use #1 method (which I would 
>> > prefer if there is no material performance or other reason to use #2), 
>> > then that specification (i.e. the "osd journal = /dev/sdb") SSD disk 
>> > reference would have to be identical on all other hardware nodes, yes (I 
>> > want to use the same ceph.conf file on all servers per the doco 
>> > recommendations)? What would happen if for example, the SSD was on 
>> > /dev/sde on a new node added into the cluster? References to 
>> > /dev/disk/by-id etc are clearly no help, so should a symlink be used from 
>> > the get-go? Eg something like "ln -s /dev/sdb /srv/ssd" on one box, and  
>> > "ln -s /dev/sde /srv/ssd" on the other box, so that in the [osd] section 
>> > we could use this line which would find the SSD disk on all nodes "osd 
>> > journal = /srv/ssd"?
>> >
>> > Many thanks for any advice provided.
>> >
>> > Cheers
>> >
>> > Paul
>> >
>> >
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to [email protected] More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to