RE: Cache tiering read-proxy mode

Sage Weil Tue, 29 Jul 2014 08:44:23 -0700

On Tue, 29 Jul 2014, Wang, Zhiqiang wrote:
> Thanks for the review.
> 
> I have one question for the comment "move the hit_set check into 
> maybe_handle_cache". The current code inserts 'oid' into the hit set 
> before calling maybe_handle_cache. If 'oid' is the same as 
> 'missing_oid', and we move the hit_set check into maybe_handle_cache, 
> we'll always see this 'oid' in the in memory hit sets, and not do 
> redirecting for the 1st read. That's the reason why I add the hit_set 
> check before the inserting.


Ah, yeah, that makes sense!

sage

> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Sage Weil
> Sent: Tuesday, July 29, 2014 4:00 AM
> To: Wang, Zhiqiang
> Cc: Zhang, Jian; '[email protected]'; '[email protected]'; 
> '[email protected]'
> Subject: RE: Cache tiering read-proxy mode
> 
> On Mon, 28 Jul 2014, Wang, Zhiqiang wrote:
> > Hi Sage,
> > 
> > I made this change in 
> > https://github.com/wonzhq/ceph/commit/924e418abb831338e2df7f4a4ec9409b02ee5524
> >  and unit tested it. Could you take a review and give comments? Thanks.
> 
> I made a few comments on the commit on github.  Overall it looks good, but we 
> should add a test to ceph_test_rados_api_tier (test/librados/tier.cc).
> 
> Thanks!
> sage
> 
> 
> > 
> > -----Original Message-----
> > From: Wang, Zhiqiang
> > Sent: Tuesday, July 22, 2014 9:38 AM
> > To: Sage Weil
> > Cc: Zhang, Jian; [email protected]; [email protected]; 
> > [email protected]
> > Subject: RE: Cache tiering read-proxy mode
> > 
> > Since we can't be accurate at the seconds level, how about making the 
> > min_read_recency_for_promote option as the number of 'hit set 
> > intervals' instead of number of seconds? So that, when 
> > min_read_recency_for_promote is
> > 1) 0, promotion on first read
> > 2) 1, promotion on second read, checking only the current hit set
> > 3) any other number, promotion on second read, keep this number 
> > (including the current one) of hit sets in memory, checking object 
> > existence in these hit sets regardless of hit set rotation
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:[email protected]]
> > Sent: Monday, July 21, 2014 10:20 PM
> > To: Wang, Zhiqiang
> > Cc: Zhang, Jian; [email protected]; [email protected]; 
> > [email protected]
> > Subject: RE: Cache tiering read-proxy mode
> > 
> > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote:
> > > In the current code, when the evict mode is idle, we just keep the 
> > > current hit set in memory. All the other hit sets (hit_set_count-1) 
> > > are on disks. And when the evict mode is not idle, all the hit sets 
> > > are loaded into memory. When the current hit set is full or exceeds 
> > > its interval, it is persisted to disk. A new hit set is created to 
> > > act as the current and the oldest is removed from disk.
> > > 
> > > So, if we introduce the min_read_recency_for_promote option, say the 
> > > user sets its value to 200, and the value of 'hit set interval' to 
> > > 60, does it mean we need to always keep 200/60+1=4 latest hit sets 
> > > in memory (Assuming 'hit set count' is greater than 4, number of 'hit set 
> > > count'
> > > if not), even if the evict mode is idle? And when persisting the 
> > > current hit set, it is still kept in memory, but the oldest 
> > > in-memory hit set is removed from memory?
> > 
> > Exactly.  We can probably just make helper that loads these into memory for 
> > the tiering agent sufficiently generic (if it isn't already) so that it 
> > keeps the right number of them in memory when the agent is inactive.
> > 
> > > Btw, I don't quite get what you said on the normal hit set rotation part.
> > 
> > If we set the tunable to, say, one hour, and the HitSet interval is also an 
> > hour, then does this mean we always have 2 HitSet's in RAM, so that we 
> > cover *at least* an hour while the newest is being populated?  If we decide 
> > to check the first and second HitSets, then we are actually covering up to 
> > double the configured period.
> > 
> > sage
> > 
> > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:[email protected]]
> > > Sent: Monday, July 21, 2014 11:55 AM
> > > To: Wang, Zhiqiang
> > > Cc: Zhang, Jian; [email protected]; [email protected]; 
> > > [email protected]
> > > Subject: RE: Cache tiering read-proxy mode
> > > 
> > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote:
> > > > For the min_read_recency_for_promote option, it's easy to 
> > > > understand the '0' and '<= hit set interval' cases. But for the '> hit 
> > > > set interval'
> > > > case, do you mean we always keep all the hit sets in RAM and check 
> > > > for the object's existence in all of them, or just load all the 
> > > > hit sets and check for object existence before the read? In 
> > > > another word, when min_read_recency_for_promote is greater than 
> > > > 'hit set interval', we always keep all the hit sets in RAM?
> > > 
> > > I'm thinking we would keep any many HitSets as are needed to cover 
> > > whatever the configured interval is.  Setting the option to the same 
> > > value as the hitset interval (or just '1'?) would be the simplest thing, 
> > > and probably the default?
> > > 
> > > We would need to decide what behavior we want with respect to the normal 
> > > HitSet rotation, though.  If they each cover, say, one hour, then on 
> > > average they will half of that, and sometimes almost no time at all (if 
> > > they just rotated).  So probably we'd want to keep the next-most-recent 
> > > in memory for some period?  It'll always be a bit imprecise, though, but 
> > > hopefully it won't really matter...
> > > 
> > > sage
> > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:[email protected]]
> > > > Sent: Monday, July 21, 2014 9:44 AM
> > > > To: Wang, Zhiqiang
> > > > Cc: Zhang, Jian; [email protected]; [email protected]; 
> > > > [email protected]
> > > > Subject: RE: Cache tiering read-proxy mode
> > > > 
> > > > [Adding ceph-devel]
> > > > 
> > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote:
> > > > > Sage,
> > > > > 
> > > > > I agree with you that promotion on the 2nd read could improve 
> > > > > cache tiering's performance for some kinds of workloads. The 
> > > > > general idea here is to implement some kinds of policies in the 
> > > > > cache tier to measure the warmness of the data. If the cache 
> > > > > tier is aware of the data warmness, it could even initiate data 
> > > > > movement between the cache tier and the base tier. This means 
> > > > > data could be prefetched into the cache tier before reading or 
> > > > > writing.
> > > > > But I think this is something we could do in the future.
> > > > 
> > > > Yeah. I suspect it will be challenging to put this sort of prefetching 
> > > > intelligence directly into the OSDs, though.  It could possibly be done 
> > > > by an external agent, maybe, or could be driven by explicit hints from 
> > > > clients ("I will probably access this data soon").
> > > > 
> > > > > The 'promotion on 2nd read' policy is straightforward. Sure it 
> > > > > will benefit some kinds of workload, but not all. If it is 
> > > > > implemented as a cache tier option, the user needs to decide to 
> > > > > turn it on or not. But I'm afraid most of the users don't have 
> > > > > the idea of this. This increases the difficulty of using cache 
> > > > > tiering.
> > > > 
> > > > I suspect the 2nd read behavior will be something we'll want to do by 
> > > > default...  but yeah, there will be a new pool option (or options) that 
> > > > controls the behavior.
> > > > 
> > > > > One question for the implementation of 'promotion on 2nd read': 
> > > > > what do we do for the 1st read? Does the cache tier read the 
> > > > > object from base tier but not doing replication, or just redirecting 
> > > > > it?
> > > > 
> > > > For the first read, we just redirect the client.  The on the second 
> > > > read, we call promote_object().  See maybe_handle_cache() in 
> > > > ReplicatedPG.cc.  
> > > > We can pretty easily tell the difference by checking the in-memory 
> > > > HitSet for a match.
> > > > 
> > > > Perhaps the option in the pool would be something like 
> > > > min_read_recency_for_promote?  If we measure "recency" as "(avg) 
> > > > seconds since last access" (loosely), 0 would mean it would promote on 
> > > > first read, and anything <= the HitSet interval would mean promote if 
> > > > the object is in the current HitSet.  > than that would mean we'd need 
> > > > to keep additional previous HitSets in RAM.
> > > > 
> > > > ...which leads us to a separate question of how to describe access 
> > > > frequency vs recency.  We keep N HitSets, each covering a time 
> > > > period of T seconds.  Normally we only keep the most recent HitSet 
> > > > in memory, unless the agent is active (flushing data).  So what I 
> > > > described above is checking how recently the last access was 
> > > > (within how many multiples of T seconds).  Additionally, though, 
> > > > we could describe the frequency of
> > > > access: was the object accesssed at least once in every N interval of T 
> > > > seconds?  Or some fraction of them?  That is probably best described as 
> > > > "temperature?"  I'm not to fond of the term "recency," tho I can't 
> > > > think of anything better right now.
> > > > 
> > > > Anyway, for the read promote behavior, recency is probably sufficient, 
> > > > but for the tiering agent flush/evict behavior temperature might be a 
> > > > good thing to consider...
> > > > 
> > > > sage
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > ceph-devel" in the body of a message to [email protected] 
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to [email protected] More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cache tiering read-proxy mode

Reply via email to