Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

Mihai Gheorghe Wed, 13 Jan 2016 03:49:00 -0800

What are the recommanded specs of a SSD for journaling. It's a little bit
tricky now to move the journals for spinners on them, because i have data
on them.


I now have all HDD journals on separate SSD. The problem is when i first
made the cluster i assigned one journal SSD to 8x4TB HDD. Now i see there
are too many spinners for one SSD.

So i am planning to assign a journal SSD to 4 OSDs, so i have an extra
redundancy-ish (if one journal crashes it only takes 4 OSD with it not 8).

Do read/write specs matter or do the IOPS matter more? The journal SSDs i
have now are, i belive, intel 520 (240GB, not that great write speeds but
high IOPS). And i have a couple of spares that i can use for jorunaling
(same type).

Also, what size should the journal partition be for one 4TB OSD. Now i have
them set at 5GB? (it's the default ceph-deploy creates) .



2016-01-12 21:43 GMT+02:00 Robert LeBlanc <[email protected]>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are using cache tiering in two production clusters at the moment.
> One cluster is running in forward mode at the moment due to the
> excessive promotion/demotion. I've got Nick's patch backported to
> Hammer and am going through the test suite at the moment. Once it
> passes, I'll create a PR so it hopefully makes into the next Hammer
> version.
>
> In response to your question about journals. Once we introduced the
> SSD cache tier, we moved or spindle journals off of the SSDs and onto
> the spindles. We found that the load on the spindles were a fraction
> of what it was before the cache tier. When we started up a host (five
> spindle journals on the same SSD as the cache pool) we would have very
> long start up times for the OSDs because the SSD was a bottleneck on
> recovery of so many OSDs. We are also finding that even though the
> Micron M600 drives perform "good enough" under steady state, there
> isn't as much headroom as there is on the Intel S3610s that we are
> also evaluating (6-8x less io time for the same IOPs on the S3610s
> compared to the M600s). Being on the limits of the M600 may also
> contribute to the inability of our busiest production clusters to run
> in writeback mode permanently.
>
> If your spindle pool is completely fronted by an SSD pool (or your
> busy pools, we don't front our glance pool for instance), I'd say
> leave the configuration simpler and co-locate the journal on the
> spindle.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWlVdvCRDmVDuy+mK58QAAo/gP+QGVClmCb3Jut6PdXc25
> B1bGawfOunsp+1c1iFfYi6BGvsf8saObq8FFZB471Yv/wQ0Y6MtuqLsiG85I
> Mzy/3rbaS4YoWrcrhwGdaXKmmSOvAy58ZFSMM1fXjd8gNSzoNFsIZL0peorH
> 94If4+o18Hpc1oUmDLO3pj2XUrbO7RXgzQFT74xJdOfgo8ozlGNF1xfvsjJI
> P5c2hVHdUfrnLoL0VFRRxVGTVmFKE6a1MSH4EiJbUDEGNNuxgztKUirBDfmV
> SyFmRryrsy/1mulminDiRsjWEjzH8YpTKw/9E212NN0BR+eXbVH2d9uiYYYc
> KeWYarxTg+09Ak9bP0IKoCP7ZgbBgrJQBnrMeFbIhM8i6OWpMNqRdniCs1nH
> /q6PzYcytYMFdAzOq3HTi8ydxli/lJ1rv7eavvjMupfJQk6JJGvDITUi0phO
> Ct7Cqu4qXLzCkQVFxmEo7grO68DTR1E26GEoINv7q3UhpQGsXzrnvYJwZoVv
> cabSHDm9TIF0hlorQRTdvzElALuoxrB/rfpxsGhC3FFlrkfgsNA4QottF+dv
> AbcxnIfMD+HoOxLadL4xKiRUVHOtgUtKuRCFGqEL7FaagyE165PiiTmR0tJk
> H+Cz6wz/fW7CnoaxBE+M733hkdTb4QYnf0wqdkJIQ7Flec988/Ds+fxPcE8o
> Hx7S
> =3WAd
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Jan 12, 2016 at 10:27 AM, Mihai Gheorghe <[email protected]>
> wrote:
> > One more question. Seeing that cache tier holds data on it untill it
> reaches
> > % ratio, i suppose i must set replication to 2 or higher on the cache
> pool
> > to not lose hot data not writen to the cold storage in case of a drive
> > failure, right?
> >
> > Also, will there be any perfomance penalty if i set the osd journal on
> the
> > same SSD as the OSD. I now have one SSD specially for journaling the SSD
> > OSDs. I know that in the case of mechanical drive this is a problem!
> >
> > And thank you for clearing this things out for me.
> >
> > 2016-01-12 18:03 GMT+02:00 Nick Fisk <[email protected]>:
> >>
> >> > -----Original Message-----
> >> > From: Mihai Gheorghe [mailto:[email protected]]
> >> > Sent: 12 January 2016 15:42
> >> > To: Nick Fisk <[email protected]>; [email protected]
> >> > Subject: Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary,
> >> > HDD
> >> > replica crush rule!
> >> >
> >> >
> >> > 2016-01-12 17:08 GMT+02:00 Nick Fisk <[email protected]>:
> >> > > -----Original Message-----
> >> > > From: ceph-users [mailto:[email protected]] On
> Behalf
> >> > Of
> >> > > Mihai Gheorghe
> >> > > Sent: 12 January 2016 14:56
> >> > > To: Nick Fisk <[email protected]>; [email protected]
> >> > > Subject: Re: [ceph-users] Ceph cache tier and rbd volumes/SSD
> primary,
> >> > HDD
> >> > > replica crush rule!
> >> > >
> >> > > Thank you very much for the quick answer.
> >> > >
> >> > > I supose cache tier works the same way for object storage aswell!?
> >> >
> >> > Yes, exactly the same. The cache is actually at the object layer
> anyway
> >> > so it
> >> > works the same. You can actually pin/unpin objects from the cache as
> >> > well if
> >> > you are using it at the object level.
> >> >
> >> > https://github.com/ceph/ceph/pull/6326
> >> > >
> >> > > How is a delete of a cinder volume handled. I ask you this because
> >> > > after the
> >> > > volume got flushed to the cold storage, i then deleted it from
> cinder.
> >> > > It got
> >> > > deleted from the cache pool aswell but on the HDD pool,when issuing
> >> > > rbd -
> >> > p
> >> > > ls the volumes were gone but the space was still used (probably
> rados
> >> > data)
> >> > > untill i manually made a flush command on the cache pool (i didn't
> >> > > wait too
> >> > > long to see if the space would be cleared in time). It is probably a
> >> > > missconfiguration from my end though.
> >> >
> >> > Ah yes, this is one of my pet hates. It's actually slightly worse than
> >> > what you
> >> > describe. All the objects have to be promoted into the cache tier to
> be
> >> > deleted and then afterwards, flushed, to remove them from the base
> tier
> >> > as
> >> > well. For a large image, this can actually take quite a long time.
> >> > Hopefully this
> >> > will be fixed at some point, I don't believe this would be too
> difficult
> >> > to fix.
> >> >
> >> > I assume this is done automatically and no need for manual flush, only
> >> > if in a
> >> > hurry, right?
> >> > What if the image is larger than the whole cache pool? I assume the
> >> > image
> >> > will be promoted into smaller object into the cache pool before
> >> > deletion.
> >> > I can live with the extra time to delete a volume from the cold
> storage.
> >> > My
> >> > only grudge is with the extra network load from the extra step of
> >> > loading the
> >> > image to the cache tier to be deleted (the SSD used for cache pool
> >> > resides on
> >> > a different host) as i don't have 10Gb ports, only 1Gb, 6 of them on
> >> > every
> >> > host in lacp mode.
> >>
> >> Yes this is fine, the objects will just get promoted until the cache is
> >> full and then the deleted ones will then be flushed out and so on. The
> only
> >> problem is that it causes cache pollution as it will force other
> objects out
> >> the cache. Like I said it's not the end of the world, but very annoying.
> >>
> >> >
> >> > >
> >> > > In you opinion is cache tier ready for production? I have read that
> >> > > bcache
> >> > > (flashcache?) is used in favor of cache tier, but is not that simple
> >> > > to setup
> >> > and
> >> > > there are disadvantages there aswell.
> >> >
> >> > See my recent posts about cache tiering, there is a fairly major bug
> >> > which
> >> > limits performance if you're working set doesn't fit in the cache.
> >> > Assuming
> >> > you are running the patch for this bug and you can live with the
> >> > deletion
> >> > problem above.....then yes I would say that its usable in production.
> >> > I'm
> >> > planning to enable it on the production pool in my cluster in the next
> >> > couple
> >> > of weeks.
> >> >
> >> > I'm sorry, i'm a bit new to the ceph mailing list. Where can i see
> your
> >> > recent
> >> > posts? I really need to check that patch out!
> >> >
> >>
> >> Here is the patch, it's in master and is in the process of being back
> >> ported to Hammer. I think for Infernalis, you will need to manually
> patch
> >> and build.
> >>
> >>
> >>
> https://github.com/zhouyuan/ceph/commit/8ffb4fba2086f5758a3b260c05d16552e995c452
> >>
> >>
> >> > >
> >> > > Also is there a problem if i add a cache tier to an already existing
> >> > > pool that
> >> > has
> >> > > data on it? Or should the pool be empty prior to adding the cache
> >> > > tier?
> >> >
> >> > Nope, that should be fine.
> >> >
> >> >
> >> > I was asking this because i have a 5TB cinder volume with data on it
> >> > (mostly
> >> > >3Gb in size). I added a cache tier to the pool that holds the volume
> >> > > and i can
> >> > see chaotic behavoiur from my W2012 instance, as in deleting files
> takes
> >> > a
> >> > very long time and not all subdirectories work (i get an error of not
> >> > finding
> >> > that directory with many small files)
> >>
> >> This could be related to the patch I mentioned. Without it, no matter
> what
> >> the promote recency settings are set to, objects will be promoted at
> almost
> >> every read/write. After the patch, ceph will obey the settings. This can
> >> quickly overload the cluster with promotions/evictions as even small FS
> >> reads will cause 4MB promotions.
> >>
> >> So you can set for example:
> >>
> >> Hit_set_count = 10
> >> Hit_set_period = 60
> >> Read_recency = 3
> >> Write_recency = 5
> >>
> >> This will generate a new hit set every 1 minute and will keep 10 of
> them.
> >> If the last 3 hit sets contain the object then it will be promoted on
> that
> >> read request, if the last 5 hit sets contain the object then it will be
> >> promoted on the write request.
> >>
> >>
> >> >
> >> > >
> >>
> >> > > 2016-01-12 16:30 GMT+02:00 Nick Fisk <[email protected]>:
> >> > > > -----Original Message-----
> >> > > > From: ceph-users [mailto:[email protected]] On
> >> > Behalf
> >> > > Of
> >> > > > Mihai Gheorghe
> >> > > > Sent: 12 January 2016 14:25
> >> > > > To: [email protected]
> >> > > > Subject: [ceph-users] Ceph cache tier and rbd volumes/SSD primary,
> >> > > > HDD
> >> > > > replica crush rule!
> >> > > >
> >> > > > Hello,
> >> > > >
> >> > > > I have a question about how cache tier works with rbd volumes!?
> >> > > >
> >> > > > So i created a pool of SSD's for cache and a pool on HDD's for
> cold
> >> > > > storage
> >> > > > that acts as backend for cinder volumes. I create a volume in
> cinder
> >> > > > from
> >> > an
> >> > > > image and spawn an instance. The volume is created in the cache
> pool
> >> > > > as
> >> > > > expected and it will be flushed to the cold storage after a period
> >> > > > of
> >> > > inactivity
> >> > > > or after the cache pool reaches 40% full as i understand.
> >> > >
> >> > > Cache won't be flushed after inactivity the cache agent only works
> on
> >> > > % full
> >> > > (either # of objects or bytes)
> >> > >
> >> > > >
> >> > > > Now after the volume is flushed to the HDD and i make a read or
> >> > > > write
> >> > > > request in the guest OS, how does ceph handle it. Does it upload
> the
> >> > whole
> >> > > > rbd volume from the cold storage to the cache pool or only a chunk
> >> > > > of it
> >> > > > where the request is made from the guest OS?
> >> > >
> >> > > The cache works on hot objects, so particular objects (normally 4MB)
> >> > > of the
> >> > > RBD will be promoted/demoted over time depending on access patterns.
> >> > >
> >> > > >
> >> > > > Also, is the replication in ceph syncronious or async? If i set a
> >> > > > crush rule to
> >> > > use
> >> > > > as primary host the SSD one and for replication the HDD one, would
> >> > > > the
> >> > > > writes and reads on the SSD;s be slowed down by the replication on
> >> > > > the
> >> > > > mechanical drive?
> >> > > > Would this configuration be viable? (i ask this because i don't
> have
> >> > > > the
> >> > > > number of SSD to make a pool of size 3 on them)
> >> > >
> >> > > Its sync replication. If you have a very heavy read workload, you
> can
> >> > > do
> >> > what
> >> > > you suggest and set the SSD OSD to be the primary copy for each PG,
> >> > writes
> >> > > will still be limited to the speed of the spinning disks, but reads
> >> > > will be
> >> > > serviced from the SSD's. However there is a risk in degraded
> scenarios
> >> > > that
> >> > > your performance could dramatically drop if more IO is diverted to
> >> > > spinning
> >> > > disks.
> >> > >
> >> > > >
> >> > > > Thank you!
> >> >
> >>
> >>
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

Reply via email to