RE: Cache pool latency impact

Pavan Rallabhandi Wed, 14 Jan 2015 01:00:57 -0800

Thanks for the reply Sage; please ignore the same subject mails on ceph-users, 
they seem to have got delivered today.

> Hmm, we could have a 'noagent' option (similar to noout, nobackfill, noscrub, 
> etc.) that lets the admin tell the system to stop tiering movements, but I'm 
> not sure that's wht you're asking for...

Was not aware of 'notieragent' flag but I was hinting at a flow control type of 
mechanism that would help throttling the client IOs versus the service time of 
the tiering agent to flush/evict.

Thanks,
-Pavan.

-----Original Message-----
From: Sage Weil [mailto:[email protected]]
Sent: Tuesday, January 13, 2015 7:31 PM
To: Pavan Rallabhandi
Cc: Ceph Development
Subject: Re: Cache pool latency impact

On Tue, 13 Jan 2015, Pavan Rallabhandi wrote:
> Hi,
>
> This is regarding cache pools and the impact of the flush/evict on the
> client IO latencies.
>
> Am seeing a direct impact on the client IO latencies (making them
> worse) when flush/evict is triggered on the cache pool. In a constant
> ingress of IOs on the cache pool, the write performance is no better
> than without cache pool, because it is limited to the speed at which
> objects can be flushed/evicted to the backend pool.

Yeah, this is always going to be true in general.  It is a lot for work to 
write into the cache, read it back, write it again into the base pool, and then 
delete it from the cache than it is to write directly to the base pool.

> > The questions I have are:
>
> 1) When the flush/evict is in progress, are the writes on the cache
> pool blocked, either at the PG or at object granularity? Though I see
> a blocking flag honored per object context in
> ReplicatedPG::start_flush() and most of the callers seem to set the flag to 
> be false.

Normally they are not blocked.  The agent starts working (finding objects to 
flush or evict) long before we hit the cut cutoff where it starts blocking.  
Once it does hit that threshold, though, things can get slow, because new cache 
creates aren't allowed until some eviction completes.  You don't want to be in 
this situation.  :)

In general, if you have a lot of data inject, caching (at least in
firefly) isn't a terribly good idea.  The exception would probably be when you 
have a high skew toward recent data (say you are injecting market data, and do 
tons of analytics on the last 24 hours, but then the data gets colder).

I can't tell if you're in the situation where the cache pool is full and the 
agent is flushing/evicing anything and everything and writes are crawling (you 
should see a message in 'ceph health' when this happens) or that the agent is 
alive but working with low effort and the impact is still high.  If it's the 
latter I'm not sure yet what is going wrong..
perhaps you can capture a few minutes of log from one of your OSDs?
(debug ms = 1, debug osd = 20).

> 2) Is there any mechanism (that I might have overlooked) to avoid this
> situation, by throttling the flush/evict operations on the fly? If
> not, shouldn't there be one?

Hmm, we could have a 'noagent' option (similar to noout, nobackfill, noscrub, 
etc.) that lets the admin tell the system to stop tiering movements, but I'm 
not sure that's wht you're asking for...

sage

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cache pool latency impact

Reply via email to