On Mon, 20 Feb 2012, Paul Pettigrew wrote: > And secondly, should the SSD Journal sizes be large or small? Ie, is > say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as > possible? There are many forum posts that say 100-200MB will suffice. > A quick piece of advice will save us hopefully sever days of > reconfiguring and benchmarking the Cluster :-)
ceph-osd will periodically do a 'commit' to ensure that stuff in the journal is written safely to the file system. On btrfs that's a snapshot, on anything else it's a sync(2). When the journals hits 50% we trigger a commit, or when a timer expires (I think 30 seconds by default). There is some overhead associated with the sync/snapshot, so less is generally better. A decent rule of thumb is probably to make the journal big enough to consume sustained writes for 10-30 seconds. On modern disks, that's probably 1-3GB? If the journal is on the same spindle as the fs, it'll be probably half that... </hand waving> sage > > Thanks > > Paul > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Wido den Hollander > Sent: Tuesday, 14 February 2012 10:46 PM > To: Paul Pettigrew > Cc: [email protected] > Subject: Re: Which SSD method is better for performance? > > Hi, > > On 02/14/2012 01:39 AM, Paul Pettigrew wrote: > > G'day all > > > > About to commence an R&D eval of the Ceph platform having been impressed > > with the momentum achieved over the past 12mths. > > > > I have one question re design before rolling out to metal........ > > > > I will be using 1x SSD drive per storage server node (assume it is /dev/sdb > > for this discussion), and cannot readily determine the pro/con's for the > > two methods of using it for OSD-Journal, being: > > #1. place it in the main [osd] stanza and reference the whole drive as > > a single partition; or > > That won't work. If you do that all OSD's will try to open the journal. > The journal for each OSD has to be unique. > > > #2. partition up the disk, so 1x partition per SATA HDD, and place > > each partition in the [osd.N] portion > > That would be your best option. > > I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf > > the VG "data" is placed on a SSD (Intel X25-M). > > > > > So if I were to code #1 in the ceph.conf file, it would be: > > [osd] > > osd journal = /dev/sdb > > > > Or, #2 would be like: > > [osd.0] > > host = ceph1 > > btrfs devs = /dev/sdc > > osd journal = /dev/sdb5 > > [osd.1] > > host = ceph1 > > btrfs devs = /dev/sdd > > osd journal = /dev/sdb6 > > [osd.2] > > host = ceph1 > > btrfs devs = /dev/sde > > osd journal = /dev/sdb7 > > [osd.3] > > host = ceph1 > > btrfs devs = /dev/sdf > > osd journal = /dev/sdb8 > > > > I am asking therefore, is the added work (and constraints) of specifying > > down to individual partitions per #2 worth it in performance gains? Does it > > not also have a constraint, in that if I wanted to add more HDD's into the > > server (we buy 45 bay units, and typically provision HDD's "on demand" i.e. > > 15x at a time as usage grows), I would have to additionally partition the > > SSD (taking it offline) - but if it were #1 option, I would only have to > > add more [osd.N] sections (and not have to worry about getting the SSD with > > 45x partitions)? > > > > You'd still have to go for #2. However, running 45 OSD's on a single machine > is a bit tricky imho. > > If that machine fails you would loose 45 OSD's at once, that will put a lot > of stress on the recovery of your cluster. > > You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB of > RAM I guess. > > A last note, if you use a SSD for your journaling, make sure that you align > your partitions which the page size of the SSD, otherwise you'd run into the > write amplification of the SSD, resulting in a performance loss. > > Wido > > > One final related question, if I were to use #1 method (which I would > > prefer if there is no material performance or other reason to use #2), then > > that specification (i.e. the "osd journal = /dev/sdb") SSD disk reference > > would have to be identical on all other hardware nodes, yes (I want to use > > the same ceph.conf file on all servers per the doco recommendations)? What > > would happen if for example, the SSD was on /dev/sde on a new node added > > into the cluster? References to /dev/disk/by-id etc are clearly no help, so > > should a symlink be used from the get-go? Eg something like "ln -s /dev/sdb > > /srv/ssd" on one box, and "ln -s /dev/sde /srv/ssd" on the other box, so > > that in the [osd] section we could use this line which would find the SSD > > disk on all nodes "osd journal = /srv/ssd"? > > > > Many thanks for any advice provided. > > > > Cheers > > > > Paul > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to [email protected] More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
