On Mon, 28 Jul 2014, Wang, Zhiqiang wrote: > Hi Sage, > > I made this change in > https://github.com/wonzhq/ceph/commit/924e418abb831338e2df7f4a4ec9409b02ee5524 > and unit tested it. Could you take a review and give comments? Thanks.
I made a few comments on the commit on github. Overall it looks good, but we should add a test to ceph_test_rados_api_tier (test/librados/tier.cc). Thanks! sage > > -----Original Message----- > From: Wang, Zhiqiang > Sent: Tuesday, July 22, 2014 9:38 AM > To: Sage Weil > Cc: Zhang, Jian; [email protected]; [email protected]; > [email protected] > Subject: RE: Cache tiering read-proxy mode > > Since we can't be accurate at the seconds level, how about making the > min_read_recency_for_promote option as the number of 'hit set intervals' > instead of number of seconds? So that, when min_read_recency_for_promote is > 1) 0, promotion on first read > 2) 1, promotion on second read, checking only the current hit set > 3) any other number, promotion on second read, keep this number (including > the current one) of hit sets in memory, checking object existence in these > hit sets regardless of hit set rotation > > -----Original Message----- > From: Sage Weil [mailto:[email protected]] > Sent: Monday, July 21, 2014 10:20 PM > To: Wang, Zhiqiang > Cc: Zhang, Jian; [email protected]; [email protected]; > [email protected] > Subject: RE: Cache tiering read-proxy mode > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > In the current code, when the evict mode is idle, we just keep the > > current hit set in memory. All the other hit sets (hit_set_count-1) > > are on disks. And when the evict mode is not idle, all the hit sets > > are loaded into memory. When the current hit set is full or exceeds > > its interval, it is persisted to disk. A new hit set is created to act > > as the current and the oldest is removed from disk. > > > > So, if we introduce the min_read_recency_for_promote option, say the > > user sets its value to 200, and the value of 'hit set interval' to 60, > > does it mean we need to always keep 200/60+1=4 latest hit sets in > > memory (Assuming 'hit set count' is greater than 4, number of 'hit set > > count' > > if not), even if the evict mode is idle? And when persisting the > > current hit set, it is still kept in memory, but the oldest in-memory > > hit set is removed from memory? > > Exactly. We can probably just make helper that loads these into memory for > the tiering agent sufficiently generic (if it isn't already) so that it keeps > the right number of them in memory when the agent is inactive. > > > Btw, I don't quite get what you said on the normal hit set rotation part. > > If we set the tunable to, say, one hour, and the HitSet interval is also an > hour, then does this mean we always have 2 HitSet's in RAM, so that we cover > *at least* an hour while the newest is being populated? If we decide to > check the first and second HitSets, then we are actually covering up to > double the configured period. > > sage > > > > -----Original Message----- > > From: Sage Weil [mailto:[email protected]] > > Sent: Monday, July 21, 2014 11:55 AM > > To: Wang, Zhiqiang > > Cc: Zhang, Jian; [email protected]; [email protected]; > > [email protected] > > Subject: RE: Cache tiering read-proxy mode > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > For the min_read_recency_for_promote option, it's easy to understand > > > the '0' and '<= hit set interval' cases. But for the '> hit set interval' > > > case, do you mean we always keep all the hit sets in RAM and check > > > for the object's existence in all of them, or just load all the hit > > > sets and check for object existence before the read? In another > > > word, when min_read_recency_for_promote is greater than 'hit set > > > interval', we always keep all the hit sets in RAM? > > > > I'm thinking we would keep any many HitSets as are needed to cover whatever > > the configured interval is. Setting the option to the same value as the > > hitset interval (or just '1'?) would be the simplest thing, and probably > > the default? > > > > We would need to decide what behavior we want with respect to the normal > > HitSet rotation, though. If they each cover, say, one hour, then on > > average they will half of that, and sometimes almost no time at all (if > > they just rotated). So probably we'd want to keep the next-most-recent in > > memory for some period? It'll always be a bit imprecise, though, but > > hopefully it won't really matter... > > > > sage > > > > > > > > -----Original Message----- > > > From: Sage Weil [mailto:[email protected]] > > > Sent: Monday, July 21, 2014 9:44 AM > > > To: Wang, Zhiqiang > > > Cc: Zhang, Jian; [email protected]; [email protected]; > > > [email protected] > > > Subject: RE: Cache tiering read-proxy mode > > > > > > [Adding ceph-devel] > > > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > > Sage, > > > > > > > > I agree with you that promotion on the 2nd read could improve > > > > cache tiering's performance for some kinds of workloads. The > > > > general idea here is to implement some kinds of policies in the > > > > cache tier to measure the warmness of the data. If the cache tier > > > > is aware of the data warmness, it could even initiate data > > > > movement between the cache tier and the base tier. This means data > > > > could be prefetched into the cache tier before reading or writing. > > > > But I think this is something we could do in the future. > > > > > > Yeah. I suspect it will be challenging to put this sort of prefetching > > > intelligence directly into the OSDs, though. It could possibly be done > > > by an external agent, maybe, or could be driven by explicit hints from > > > clients ("I will probably access this data soon"). > > > > > > > The 'promotion on 2nd read' policy is straightforward. Sure it > > > > will benefit some kinds of workload, but not all. If it is > > > > implemented as a cache tier option, the user needs to decide to > > > > turn it on or not. But I'm afraid most of the users don't have the > > > > idea of this. This increases the difficulty of using cache tiering. > > > > > > I suspect the 2nd read behavior will be something we'll want to do by > > > default... but yeah, there will be a new pool option (or options) that > > > controls the behavior. > > > > > > > One question for the implementation of 'promotion on 2nd read': > > > > what do we do for the 1st read? Does the cache tier read the > > > > object from base tier but not doing replication, or just redirecting it? > > > > > > For the first read, we just redirect the client. The on the second read, > > > we call promote_object(). See maybe_handle_cache() in ReplicatedPG.cc. > > > We can pretty easily tell the difference by checking the in-memory HitSet > > > for a match. > > > > > > Perhaps the option in the pool would be something like > > > min_read_recency_for_promote? If we measure "recency" as "(avg) seconds > > > since last access" (loosely), 0 would mean it would promote on first > > > read, and anything <= the HitSet interval would mean promote if the > > > object is in the current HitSet. > than that would mean we'd need to > > > keep additional previous HitSets in RAM. > > > > > > ...which leads us to a separate question of how to describe access > > > frequency vs recency. We keep N HitSets, each covering a time > > > period of T seconds. Normally we only keep the most recent HitSet > > > in memory, unless the agent is active (flushing data). So what I > > > described above is checking how recently the last access was (within > > > how many multiples of T seconds). Additionally, though, we could > > > describe the frequency of > > > access: was the object accesssed at least once in every N interval of T > > > seconds? Or some fraction of them? That is probably best described as > > > "temperature?" I'm not to fond of the term "recency," tho I can't think > > > of anything better right now. > > > > > > Anyway, for the read promote behavior, recency is probably sufficient, > > > but for the tiering agent flush/evict behavior temperature might be a > > > good thing to consider... > > > > > > sage > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body of a message to [email protected] > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
