On Thu, 4 Jun 2015, Samuel Just wrote: > Yeah, I think the right answer is to remove that check and restructure > all of the existing reader ops which rely on it to return ENOENT from > do_osd_ops. For copy_get, we instead return 0 and indicate in the > structured payload that the object logically does not exist (along with > the relevant log entries).
We'll need a feature bit so that an old cache tier OSD talking to a new base tier OSD will still get an ENOENT that it can understand. sage > -Sam > > ----- Original Message ----- > From: "Samuel Just" <[email protected]> > To: "Zhiqiang Wang" <[email protected]> > Cc: "David Zafman" <[email protected]>, "Sage Weil" <[email protected]>, > [email protected] > Sent: Thursday, June 4, 2015 11:12:14 AM > Subject: Re: 'Racing read got wrong version' during proxy write testing > > copy_get can happen outside of the context of a cache op, so we can't > (shouldn't) just flag it as a cache op. I wonder whether it wouldn't be > better to live without that check at all and let all of the ops in do_osd_ops > individually return ENOENT as appropriate. > -Sam > > ----- Original Message ----- > From: "Zhiqiang Wang" <[email protected]> > To: "David Zafman" <[email protected]>, "Sage Weil" <[email protected]> > Cc: [email protected] > Sent: Wednesday, June 3, 2015 7:08:35 PM > Subject: RE: 'Racing read got wrong version' during proxy write testing > > Hi David, > > Proxy write hasn't been merge into master yet. It's not likely this is > causing #11511. > > -----Original Message----- > From: David Zafman [mailto:[email protected]] > Sent: Thursday, June 4, 2015 9:46 AM > To: Wang, Zhiqiang; Sage Weil > Cc: [email protected] > Subject: Re: 'Racing read got wrong version' during proxy write testing > > > I'm wonder if this issue could be the cause of #11511. Could a proxy write > have raced with the fill_in_copy_get() so object_info_t size doesn't > correspond with the size of the object in the filestore? > > David > > > On 6/3/15 6:22 PM, Wang, Zhiqiang wrote: > > Making the 'copy get' op to be a cache op seems like a good idea. > > > > -----Original Message----- > > From: Sage Weil [mailto:[email protected]] > > Sent: Thursday, June 4, 2015 9:14 AM > > To: Wang, Zhiqiang > > Cc: [email protected] > > Subject: RE: 'Racing read got wrong version' during proxy write > > testing > > > > On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: > >> I ran into the 'op not idempotent' problem during the testing today. > >> There is one bug in the previous fix. In that fix, we copy the reqids > >> in the final step of 'fill_in_copy_get'. If the object is deleted, > >> since the 'copy get' op is a read op, it returns earlier with ENOENT in > >> do_op. > >> No reqids will be copied during promotion in this case. This again > >> leads to the 'op not idempotent' problem. We need a 'smart' way to > >> detect the op is a 'copy get' op (looping the ops vector doesn't seem > >> smart?) and copy the reqids in this case. > > Hmm. I think the idea here is/was that that ENOENT would somehow include > > the reqid list from PGLog::get_object_reqids(). > > > > I think teh trick is getting it past the generic check in do_op: > > > > if (!op->may_write() && > > !op->may_cache() && > > (!obc->obs.exists || > > ((m->get_snapid() != CEPH_SNAPDIR) && > > obc->obs.oi.is_whiteout()))) { > > reply_ctx(ctx, -ENOENT); > > return; > > } > > > > Maybe we mark these as cache operations so that may_cache is true? > > > > Sam, what do you think? > > > > sage > > > > > >> -----Original Message----- > >> From: Sage Weil [mailto:[email protected]] > >> Sent: Tuesday, May 26, 2015 12:27 AM > >> To: Wang, Zhiqiang > >> Cc: [email protected] > >> Subject: Re: 'Racing read got wrong version' during proxy write > >> testing > >> > >> On Mon, 25 May 2015, Wang, Zhiqiang wrote: > >>> Hi all, > >>> > >>> I ran into a problem during the teuthology test of proxy write. It is > >>> like this: > >>> > >>> - Client sends 3 writes and a read on the same object to base tier > >>> - Set up cache tiering > >>> - Client retries ops and sends the 3 writes and 1 read to the cache > >>> tier > >>> - The 3 writes finished on the base tier, say with versions v1, v2 > >>> and > >>> v3 > >>> - Cache tier proxies the 1st write, and start to promote the object > >>> for the 2nd write, the 2nd and 3rd writes and the read are blocked > >>> - The proxied 1st write finishes on the base tier with version v4, > >>> and returns to cache tier. But somehow the cache tier fails to send > >>> the reply due to socket failure injecting > >>> - Client retries the writes and the read again, the writes are > >>> identified as dup ops > >>> - The promotion finishes, it copies the pg_log entries from the base > >>> tier and put it in the cache tier's pg_log. This includes the 3 > >>> writes on the base tier and the proxied write > >>> - The writes dispatches after the promotion, they are identified as > >>> completed dup ops. Cache tier replies these write ops with the > >>> version from the base tier (v1, v2 and v3) > >>> - In the last, the read dispatches, it reads the version of the > >>> proxied write (v4) and replies to client > >>> - Client complains that 'racing read got wrong version' > >>> > >>> In a previous discussion of the 'ops not idempotent' problem, we solved > >>> it by copying the pg_log entries in the base tier to cache tier during > >>> promotion. Seems like there is still a problem with this approach in the > >>> above scenario. My first thought is that when proxying the write, the > >>> cache tier should use the original reqid from the client. But currently > >>> we don't have a way to pass the original reqid from cache to base. Any > >>> ideas? > >> I agree--I think the correct fix here is to make the proxied op be > >> recognized as a dup. We can either do that by passing in an optional > >> reqid to the Objecter, or extending the op somehow so that both reqids are > >> listed. I think the first option will be cleaner, but I think we will > >> also need to make sure the 'retry' count is preserved as (I think) we skip > >> the dup check if retry==0. And we probably want to preserve the behavior > >> that a given (reqid, retry) only exists once in the system. > >> > >> This probably means adding more optional args to Objecter::read()...? > >> > >> sage > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
