Re: [ceph-users] Cache Tier or any other possibility to accelerate RBD with SSD?

David Turner Mon, 03 Jul 2017 06:37:13 -0700

Also please search the ML door min_size=1. Curiously you're doing it with
size=3.


On Mon, Jul 3, 2017, 8:33 AM Christian Balzer <[email protected]> wrote:

>
> Hello,
>
> On Mon, 3 Jul 2017 14:18:27 +0200 Mateusz Skała wrote:
>
> > @Christian ,thanks for quick answer, please look bellow.
> >
> > > -----Original Message-----
> > > From: Christian Balzer [mailto:[email protected]]
> > > Sent: Monday, July 3, 2017 1:39 PM
> > > To: [email protected]
> > > Cc: Mateusz Skała <[email protected]>
> > > Subject: Re: [ceph-users] Cache Tier or any other possibility to
> accelerate
> > > RBD with SSD?
> > >
> > >
> > > Hello,
> > >
> > > On Mon, 3 Jul 2017 13:01:06 +0200 Mateusz Skała wrote:
> > >
> > > > Hello,
> > > >
> > > > We are using cache-tier in Read-forward mode (replica 3) for
> > > > accelerate reads and journals on SSD to accelerate writes.
> > >
> > > OK, lots of things wrong with this statement, but firstly, Ceph
> version (it is
> > > relevant) and more details about your setup and SSDs used would be
> > > interesting and helpful.
> > >
> >
> > Sorry about this. Ceph version 0.92.1 and we plan to upgrade to 10.2.0
> in short time.
>
> I'd never run (in production) one of the short term support versions like
> Kraken, you're not getting ANY bug fixes there at all.
>
> But I guess this means that the dire warning when creating readforward
> cache pools was only added with Jewel.
> But the problem with that mode is of course present in all other versions
> that have it.
>
> > About the configuration:
> > 4 nodes, each node with:
> > -  4x HDD WD Re 2TB WD2004FBYZ,
> > -  2x SSD Intel S3610 200GB (one for journal and system with mon, second
> for cache-tier).
> >
> > It gives 32TB RAW HDD space and only 600GB RAW SSD space, and I think it
> is problem with small size of cache.
> >
> Don't "think" if you can quantify. Between iostat and the Ceph perf
> counters you can determine how much data goes in and out of your cluster
> and OSDs and how much you'd need to get through a typical day with I/O
> mostly on your cache tier.
>
> That said, 200GB effective cache space is likely to be a bottleneck, yes.
>
> > > If you had searched the ML archives for readforward you'd come across a
> > > very recent thread by me, in which the powers that be state that this
> mode is
> > > dangerous and not recommended.
> > > During quite some testing with this mode I never encountered any
> problems,
> > > but consider yourself warned.
> > >
> > > Now readforward will FORWARD reads to the backing storage, so it will
> > > NEVER accelerate reads (promote them to the cache-tier).
> > > The only speedup you will see is for objects that have been previously
> > > written and are still in the cache-tier.
> > >
> >
> > Ceph osd pool ls detail
> > pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 128 pgp_num 128 last_change 88643 flags
> hashpspool,incomplete_clones tier_of 5 cache_mode readforward target_bytes
> 176093659136 hit_set bloom{false_positive_probability: 0.05, target_size:
> 0, seed: 0} 120s x6 min_read_recency_for_promote 1
> min_write_recency_for_promote 1 stripe_width 0
> >         removed_snaps
> [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]
> > pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 88643 lfor 66807 flags
> hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
> >         removed_snaps
> [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]
> >
> > The setup has over 1 year. On ceph status I see flushing, promote and
> evicting operations. Maybe it depends on my old version?
> >
> Nothing to do with your version as far as vulnerability to the problem is
> concerned.
>
> And you see all the flushing etc because WRITES are going through your
> cache-tier of course, as I stated above.
> However if your goal is to cache reads, this is the wrong mode and in
> general probably a bad fit for a small cache-tier.
>
>
> Christian
>
> > > Using cache-tiers can work beautifully if you understand the I/O
> patterns
> > > involved (tricky on a cloud storage with very mixed clients), can make
> your
> > > cache-tier large enough to cover the hot objects (working set) or at
> least (as
> > > you are attempting) to segregate the read and write paths as much as
> > > possible.
> > >
> > Have you got any good method to analyze workload?
> > I found this script https://github.com/cernceph/ceph-scripts and try to
> see reads and writes per length, but how to know it is random or sequential
> io?
> >
> > > > We are using only RBD. Based
> > > > on the ceph-docs, RBD have bad I/O pattern for cache tier.  I'm
> > > > looking for information about other possibility to accelerate reads
> on
> > > > RBD with SSD drives.
> > > >
> > > The documentation rightly warns about things, so people don't have
> > > unrealistic expectations. However YOU need to look at YOUR loads,
> patterns
> > > and usage and then decide if it is beneficial or not.
> > >
> > > As I hinted above, analyze your systems, are the reads actually slow
> or are
> > > they slowed down by competing writes to the same storage?
> > >
> > > Cold reads (OSD server just rebooted, no cache has that object in it)
> will
> > > obviously not benefit from any scheme.
> > >
> > > Reads from the HDD OSDs can very much benefit by having enough RAM to
> > > hold all the SLAB objects (direntry etc) in memory, so you can avoid
> disk
> > > access to actually find the object.
> > >
> > > Speeding up the actual data read you have the option of the cache-tier
> (in
> > > writeback mode, with proper promotion and retention configuration).
> > >
> > > Or something like bcache on the OSD servers, discussed here several
> times
> > > as well.
> > >
> > > > The second question, is it any cache tier mode, that replica can be
> > > > set on 1, for best use of SSD space?
> > > >
> > > A cache-tier (the same true for any other real cache methods) will at
> any
> > > given time have objects in it that are NOT on the actual backing
> storage when
> > > it is used to cache writes.
> > > So it needs to be just as redundant as the rest of the system, at
> least a replica
> > > of 2 with sufficiently small/fast SSDs.
> > >
> >
> > OK, I understand.
> >
> > > With bcache etc just caching reads, you can get away with a single
> replication
> > > of course, however failing SSDs may then cause your cluster to melt
> down.
> > >
> >
> > I will search ML for this.
> >
> > > Christian
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > [email protected]       Rakuten Communications
> >
> >
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Rakuten Communications
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier or any other possibility to accelerate RBD with SSD?

Reply via email to