What are the recommanded specs of a SSD for journaling. It's a little bit tricky now to move the journals for spinners on them, because i have data on them.
I now have all HDD journals on separate SSD. The problem is when i first made the cluster i assigned one journal SSD to 8x4TB HDD. Now i see there are too many spinners for one SSD. So i am planning to assign a journal SSD to 4 OSDs, so i have an extra redundancy-ish (if one journal crashes it only takes 4 OSD with it not 8). Do read/write specs matter or do the IOPS matter more? The journal SSDs i have now are, i belive, intel 520 (240GB, not that great write speeds but high IOPS). And i have a couple of spares that i can use for jorunaling (same type). Also, what size should the journal partition be for one 4TB OSD. Now i have them set at 5GB? (it's the default ceph-deploy creates) . 2016-01-12 21:43 GMT+02:00 Robert LeBlanc <[email protected]>: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > We are using cache tiering in two production clusters at the moment. > One cluster is running in forward mode at the moment due to the > excessive promotion/demotion. I've got Nick's patch backported to > Hammer and am going through the test suite at the moment. Once it > passes, I'll create a PR so it hopefully makes into the next Hammer > version. > > In response to your question about journals. Once we introduced the > SSD cache tier, we moved or spindle journals off of the SSDs and onto > the spindles. We found that the load on the spindles were a fraction > of what it was before the cache tier. When we started up a host (five > spindle journals on the same SSD as the cache pool) we would have very > long start up times for the OSDs because the SSD was a bottleneck on > recovery of so many OSDs. We are also finding that even though the > Micron M600 drives perform "good enough" under steady state, there > isn't as much headroom as there is on the Intel S3610s that we are > also evaluating (6-8x less io time for the same IOPs on the S3610s > compared to the M600s). Being on the limits of the M600 may also > contribute to the inability of our busiest production clusters to run > in writeback mode permanently. > > If your spindle pool is completely fronted by an SSD pool (or your > busy pools, we don't front our glance pool for instance), I'd say > leave the configuration simpler and co-locate the journal on the > spindle. > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.3.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWlVdvCRDmVDuy+mK58QAAo/gP+QGVClmCb3Jut6PdXc25 > B1bGawfOunsp+1c1iFfYi6BGvsf8saObq8FFZB471Yv/wQ0Y6MtuqLsiG85I > Mzy/3rbaS4YoWrcrhwGdaXKmmSOvAy58ZFSMM1fXjd8gNSzoNFsIZL0peorH > 94If4+o18Hpc1oUmDLO3pj2XUrbO7RXgzQFT74xJdOfgo8ozlGNF1xfvsjJI > P5c2hVHdUfrnLoL0VFRRxVGTVmFKE6a1MSH4EiJbUDEGNNuxgztKUirBDfmV > SyFmRryrsy/1mulminDiRsjWEjzH8YpTKw/9E212NN0BR+eXbVH2d9uiYYYc > KeWYarxTg+09Ak9bP0IKoCP7ZgbBgrJQBnrMeFbIhM8i6OWpMNqRdniCs1nH > /q6PzYcytYMFdAzOq3HTi8ydxli/lJ1rv7eavvjMupfJQk6JJGvDITUi0phO > Ct7Cqu4qXLzCkQVFxmEo7grO68DTR1E26GEoINv7q3UhpQGsXzrnvYJwZoVv > cabSHDm9TIF0hlorQRTdvzElALuoxrB/rfpxsGhC3FFlrkfgsNA4QottF+dv > AbcxnIfMD+HoOxLadL4xKiRUVHOtgUtKuRCFGqEL7FaagyE165PiiTmR0tJk > H+Cz6wz/fW7CnoaxBE+M733hkdTb4QYnf0wqdkJIQ7Flec988/Ds+fxPcE8o > Hx7S > =3WAd > -----END PGP SIGNATURE----- > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Tue, Jan 12, 2016 at 10:27 AM, Mihai Gheorghe <[email protected]> > wrote: > > One more question. Seeing that cache tier holds data on it untill it > reaches > > % ratio, i suppose i must set replication to 2 or higher on the cache > pool > > to not lose hot data not writen to the cold storage in case of a drive > > failure, right? > > > > Also, will there be any perfomance penalty if i set the osd journal on > the > > same SSD as the OSD. I now have one SSD specially for journaling the SSD > > OSDs. I know that in the case of mechanical drive this is a problem! > > > > And thank you for clearing this things out for me. > > > > 2016-01-12 18:03 GMT+02:00 Nick Fisk <[email protected]>: > >> > >> > -----Original Message----- > >> > From: Mihai Gheorghe [mailto:[email protected]] > >> > Sent: 12 January 2016 15:42 > >> > To: Nick Fisk <[email protected]>; [email protected] > >> > Subject: Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, > >> > HDD > >> > replica crush rule! > >> > > >> > > >> > 2016-01-12 17:08 GMT+02:00 Nick Fisk <[email protected]>: > >> > > -----Original Message----- > >> > > From: ceph-users [mailto:[email protected]] On > Behalf > >> > Of > >> > > Mihai Gheorghe > >> > > Sent: 12 January 2016 14:56 > >> > > To: Nick Fisk <[email protected]>; [email protected] > >> > > Subject: Re: [ceph-users] Ceph cache tier and rbd volumes/SSD > primary, > >> > HDD > >> > > replica crush rule! > >> > > > >> > > Thank you very much for the quick answer. > >> > > > >> > > I supose cache tier works the same way for object storage aswell!? > >> > > >> > Yes, exactly the same. The cache is actually at the object layer > anyway > >> > so it > >> > works the same. You can actually pin/unpin objects from the cache as > >> > well if > >> > you are using it at the object level. > >> > > >> > https://github.com/ceph/ceph/pull/6326 > >> > > > >> > > How is a delete of a cinder volume handled. I ask you this because > >> > > after the > >> > > volume got flushed to the cold storage, i then deleted it from > cinder. > >> > > It got > >> > > deleted from the cache pool aswell but on the HDD pool,when issuing > >> > > rbd - > >> > p > >> > > ls the volumes were gone but the space was still used (probably > rados > >> > data) > >> > > untill i manually made a flush command on the cache pool (i didn't > >> > > wait too > >> > > long to see if the space would be cleared in time). It is probably a > >> > > missconfiguration from my end though. > >> > > >> > Ah yes, this is one of my pet hates. It's actually slightly worse than > >> > what you > >> > describe. All the objects have to be promoted into the cache tier to > be > >> > deleted and then afterwards, flushed, to remove them from the base > tier > >> > as > >> > well. For a large image, this can actually take quite a long time. > >> > Hopefully this > >> > will be fixed at some point, I don't believe this would be too > difficult > >> > to fix. > >> > > >> > I assume this is done automatically and no need for manual flush, only > >> > if in a > >> > hurry, right? > >> > What if the image is larger than the whole cache pool? I assume the > >> > image > >> > will be promoted into smaller object into the cache pool before > >> > deletion. > >> > I can live with the extra time to delete a volume from the cold > storage. > >> > My > >> > only grudge is with the extra network load from the extra step of > >> > loading the > >> > image to the cache tier to be deleted (the SSD used for cache pool > >> > resides on > >> > a different host) as i don't have 10Gb ports, only 1Gb, 6 of them on > >> > every > >> > host in lacp mode. > >> > >> Yes this is fine, the objects will just get promoted until the cache is > >> full and then the deleted ones will then be flushed out and so on. The > only > >> problem is that it causes cache pollution as it will force other > objects out > >> the cache. Like I said it's not the end of the world, but very annoying. > >> > >> > > >> > > > >> > > In you opinion is cache tier ready for production? I have read that > >> > > bcache > >> > > (flashcache?) is used in favor of cache tier, but is not that simple > >> > > to setup > >> > and > >> > > there are disadvantages there aswell. > >> > > >> > See my recent posts about cache tiering, there is a fairly major bug > >> > which > >> > limits performance if you're working set doesn't fit in the cache. > >> > Assuming > >> > you are running the patch for this bug and you can live with the > >> > deletion > >> > problem above.....then yes I would say that its usable in production. > >> > I'm > >> > planning to enable it on the production pool in my cluster in the next > >> > couple > >> > of weeks. > >> > > >> > I'm sorry, i'm a bit new to the ceph mailing list. Where can i see > your > >> > recent > >> > posts? I really need to check that patch out! > >> > > >> > >> Here is the patch, it's in master and is in the process of being back > >> ported to Hammer. I think for Infernalis, you will need to manually > patch > >> and build. > >> > >> > >> > https://github.com/zhouyuan/ceph/commit/8ffb4fba2086f5758a3b260c05d16552e995c452 > >> > >> > >> > > > >> > > Also is there a problem if i add a cache tier to an already existing > >> > > pool that > >> > has > >> > > data on it? Or should the pool be empty prior to adding the cache > >> > > tier? > >> > > >> > Nope, that should be fine. > >> > > >> > > >> > I was asking this because i have a 5TB cinder volume with data on it > >> > (mostly > >> > >3Gb in size). I added a cache tier to the pool that holds the volume > >> > > and i can > >> > see chaotic behavoiur from my W2012 instance, as in deleting files > takes > >> > a > >> > very long time and not all subdirectories work (i get an error of not > >> > finding > >> > that directory with many small files) > >> > >> This could be related to the patch I mentioned. Without it, no matter > what > >> the promote recency settings are set to, objects will be promoted at > almost > >> every read/write. After the patch, ceph will obey the settings. This can > >> quickly overload the cluster with promotions/evictions as even small FS > >> reads will cause 4MB promotions. > >> > >> So you can set for example: > >> > >> Hit_set_count = 10 > >> Hit_set_period = 60 > >> Read_recency = 3 > >> Write_recency = 5 > >> > >> This will generate a new hit set every 1 minute and will keep 10 of > them. > >> If the last 3 hit sets contain the object then it will be promoted on > that > >> read request, if the last 5 hit sets contain the object then it will be > >> promoted on the write request. > >> > >> > >> > > >> > > > >> > >> > > 2016-01-12 16:30 GMT+02:00 Nick Fisk <[email protected]>: > >> > > > -----Original Message----- > >> > > > From: ceph-users [mailto:[email protected]] On > >> > Behalf > >> > > Of > >> > > > Mihai Gheorghe > >> > > > Sent: 12 January 2016 14:25 > >> > > > To: [email protected] > >> > > > Subject: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, > >> > > > HDD > >> > > > replica crush rule! > >> > > > > >> > > > Hello, > >> > > > > >> > > > I have a question about how cache tier works with rbd volumes!? > >> > > > > >> > > > So i created a pool of SSD's for cache and a pool on HDD's for > cold > >> > > > storage > >> > > > that acts as backend for cinder volumes. I create a volume in > cinder > >> > > > from > >> > an > >> > > > image and spawn an instance. The volume is created in the cache > pool > >> > > > as > >> > > > expected and it will be flushed to the cold storage after a period > >> > > > of > >> > > inactivity > >> > > > or after the cache pool reaches 40% full as i understand. > >> > > > >> > > Cache won't be flushed after inactivity the cache agent only works > on > >> > > % full > >> > > (either # of objects or bytes) > >> > > > >> > > > > >> > > > Now after the volume is flushed to the HDD and i make a read or > >> > > > write > >> > > > request in the guest OS, how does ceph handle it. Does it upload > the > >> > whole > >> > > > rbd volume from the cold storage to the cache pool or only a chunk > >> > > > of it > >> > > > where the request is made from the guest OS? > >> > > > >> > > The cache works on hot objects, so particular objects (normally 4MB) > >> > > of the > >> > > RBD will be promoted/demoted over time depending on access patterns. > >> > > > >> > > > > >> > > > Also, is the replication in ceph syncronious or async? If i set a > >> > > > crush rule to > >> > > use > >> > > > as primary host the SSD one and for replication the HDD one, would > >> > > > the > >> > > > writes and reads on the SSD;s be slowed down by the replication on > >> > > > the > >> > > > mechanical drive? > >> > > > Would this configuration be viable? (i ask this because i don't > have > >> > > > the > >> > > > number of SSD to make a pool of size 3 on them) > >> > > > >> > > Its sync replication. If you have a very heavy read workload, you > can > >> > > do > >> > what > >> > > you suggest and set the SSD OSD to be the primary copy for each PG, > >> > writes > >> > > will still be limited to the speed of the spinning disks, but reads > >> > > will be > >> > > serviced from the SSD's. However there is a risk in degraded > scenarios > >> > > that > >> > > your performance could dramatically drop if more IO is diverted to > >> > > spinning > >> > > disks. > >> > > > >> > > > > >> > > > Thank you! > >> > > >> > >> > > > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
