On Tue, 29 Jul 2014, Wang, Zhiqiang wrote: > Thanks for the review. > > I have one question for the comment "move the hit_set check into > maybe_handle_cache". The current code inserts 'oid' into the hit set > before calling maybe_handle_cache. If 'oid' is the same as > 'missing_oid', and we move the hit_set check into maybe_handle_cache, > we'll always see this 'oid' in the in memory hit sets, and not do > redirecting for the 1st read. That's the reason why I add the hit_set > check before the inserting.
Ah, yeah, that makes sense! sage > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Sage Weil > Sent: Tuesday, July 29, 2014 4:00 AM > To: Wang, Zhiqiang > Cc: Zhang, Jian; '[email protected]'; '[email protected]'; > '[email protected]' > Subject: RE: Cache tiering read-proxy mode > > On Mon, 28 Jul 2014, Wang, Zhiqiang wrote: > > Hi Sage, > > > > I made this change in > > https://github.com/wonzhq/ceph/commit/924e418abb831338e2df7f4a4ec9409b02ee5524 > > and unit tested it. Could you take a review and give comments? Thanks. > > I made a few comments on the commit on github. Overall it looks good, but we > should add a test to ceph_test_rados_api_tier (test/librados/tier.cc). > > Thanks! > sage > > > > > > -----Original Message----- > > From: Wang, Zhiqiang > > Sent: Tuesday, July 22, 2014 9:38 AM > > To: Sage Weil > > Cc: Zhang, Jian; [email protected]; [email protected]; > > [email protected] > > Subject: RE: Cache tiering read-proxy mode > > > > Since we can't be accurate at the seconds level, how about making the > > min_read_recency_for_promote option as the number of 'hit set > > intervals' instead of number of seconds? So that, when > > min_read_recency_for_promote is > > 1) 0, promotion on first read > > 2) 1, promotion on second read, checking only the current hit set > > 3) any other number, promotion on second read, keep this number > > (including the current one) of hit sets in memory, checking object > > existence in these hit sets regardless of hit set rotation > > > > -----Original Message----- > > From: Sage Weil [mailto:[email protected]] > > Sent: Monday, July 21, 2014 10:20 PM > > To: Wang, Zhiqiang > > Cc: Zhang, Jian; [email protected]; [email protected]; > > [email protected] > > Subject: RE: Cache tiering read-proxy mode > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > In the current code, when the evict mode is idle, we just keep the > > > current hit set in memory. All the other hit sets (hit_set_count-1) > > > are on disks. And when the evict mode is not idle, all the hit sets > > > are loaded into memory. When the current hit set is full or exceeds > > > its interval, it is persisted to disk. A new hit set is created to > > > act as the current and the oldest is removed from disk. > > > > > > So, if we introduce the min_read_recency_for_promote option, say the > > > user sets its value to 200, and the value of 'hit set interval' to > > > 60, does it mean we need to always keep 200/60+1=4 latest hit sets > > > in memory (Assuming 'hit set count' is greater than 4, number of 'hit set > > > count' > > > if not), even if the evict mode is idle? And when persisting the > > > current hit set, it is still kept in memory, but the oldest > > > in-memory hit set is removed from memory? > > > > Exactly. We can probably just make helper that loads these into memory for > > the tiering agent sufficiently generic (if it isn't already) so that it > > keeps the right number of them in memory when the agent is inactive. > > > > > Btw, I don't quite get what you said on the normal hit set rotation part. > > > > If we set the tunable to, say, one hour, and the HitSet interval is also an > > hour, then does this mean we always have 2 HitSet's in RAM, so that we > > cover *at least* an hour while the newest is being populated? If we decide > > to check the first and second HitSets, then we are actually covering up to > > double the configured period. > > > > sage > > > > > > > -----Original Message----- > > > From: Sage Weil [mailto:[email protected]] > > > Sent: Monday, July 21, 2014 11:55 AM > > > To: Wang, Zhiqiang > > > Cc: Zhang, Jian; [email protected]; [email protected]; > > > [email protected] > > > Subject: RE: Cache tiering read-proxy mode > > > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > > For the min_read_recency_for_promote option, it's easy to > > > > understand the '0' and '<= hit set interval' cases. But for the '> hit > > > > set interval' > > > > case, do you mean we always keep all the hit sets in RAM and check > > > > for the object's existence in all of them, or just load all the > > > > hit sets and check for object existence before the read? In > > > > another word, when min_read_recency_for_promote is greater than > > > > 'hit set interval', we always keep all the hit sets in RAM? > > > > > > I'm thinking we would keep any many HitSets as are needed to cover > > > whatever the configured interval is. Setting the option to the same > > > value as the hitset interval (or just '1'?) would be the simplest thing, > > > and probably the default? > > > > > > We would need to decide what behavior we want with respect to the normal > > > HitSet rotation, though. If they each cover, say, one hour, then on > > > average they will half of that, and sometimes almost no time at all (if > > > they just rotated). So probably we'd want to keep the next-most-recent > > > in memory for some period? It'll always be a bit imprecise, though, but > > > hopefully it won't really matter... > > > > > > sage > > > > > > > > > > > -----Original Message----- > > > > From: Sage Weil [mailto:[email protected]] > > > > Sent: Monday, July 21, 2014 9:44 AM > > > > To: Wang, Zhiqiang > > > > Cc: Zhang, Jian; [email protected]; [email protected]; > > > > [email protected] > > > > Subject: RE: Cache tiering read-proxy mode > > > > > > > > [Adding ceph-devel] > > > > > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > > > Sage, > > > > > > > > > > I agree with you that promotion on the 2nd read could improve > > > > > cache tiering's performance for some kinds of workloads. The > > > > > general idea here is to implement some kinds of policies in the > > > > > cache tier to measure the warmness of the data. If the cache > > > > > tier is aware of the data warmness, it could even initiate data > > > > > movement between the cache tier and the base tier. This means > > > > > data could be prefetched into the cache tier before reading or > > > > > writing. > > > > > But I think this is something we could do in the future. > > > > > > > > Yeah. I suspect it will be challenging to put this sort of prefetching > > > > intelligence directly into the OSDs, though. It could possibly be done > > > > by an external agent, maybe, or could be driven by explicit hints from > > > > clients ("I will probably access this data soon"). > > > > > > > > > The 'promotion on 2nd read' policy is straightforward. Sure it > > > > > will benefit some kinds of workload, but not all. If it is > > > > > implemented as a cache tier option, the user needs to decide to > > > > > turn it on or not. But I'm afraid most of the users don't have > > > > > the idea of this. This increases the difficulty of using cache > > > > > tiering. > > > > > > > > I suspect the 2nd read behavior will be something we'll want to do by > > > > default... but yeah, there will be a new pool option (or options) that > > > > controls the behavior. > > > > > > > > > One question for the implementation of 'promotion on 2nd read': > > > > > what do we do for the 1st read? Does the cache tier read the > > > > > object from base tier but not doing replication, or just redirecting > > > > > it? > > > > > > > > For the first read, we just redirect the client. The on the second > > > > read, we call promote_object(). See maybe_handle_cache() in > > > > ReplicatedPG.cc. > > > > We can pretty easily tell the difference by checking the in-memory > > > > HitSet for a match. > > > > > > > > Perhaps the option in the pool would be something like > > > > min_read_recency_for_promote? If we measure "recency" as "(avg) > > > > seconds since last access" (loosely), 0 would mean it would promote on > > > > first read, and anything <= the HitSet interval would mean promote if > > > > the object is in the current HitSet. > than that would mean we'd need > > > > to keep additional previous HitSets in RAM. > > > > > > > > ...which leads us to a separate question of how to describe access > > > > frequency vs recency. We keep N HitSets, each covering a time > > > > period of T seconds. Normally we only keep the most recent HitSet > > > > in memory, unless the agent is active (flushing data). So what I > > > > described above is checking how recently the last access was > > > > (within how many multiples of T seconds). Additionally, though, > > > > we could describe the frequency of > > > > access: was the object accesssed at least once in every N interval of T > > > > seconds? Or some fraction of them? That is probably best described as > > > > "temperature?" I'm not to fond of the term "recency," tho I can't > > > > think of anything better right now. > > > > > > > > Anyway, for the read promote behavior, recency is probably sufficient, > > > > but for the tiering agent flush/evict behavior temperature might be a > > > > good thing to consider... > > > > > > > > sage > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe > > > > ceph-devel" in the body of a message to [email protected] > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to [email protected] More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to [email protected] More majordomo info at > http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
