Re: [ceph-users] cls_rbd ops on rbd_id.$name objects in EC pool

Sage Weil Thu, 11 Feb 2016 13:00:08 -0800

I was able to reproduce this on master:

On Thu, 11 Feb 2016, Jason Dillaman wrote:
> I think I see the problem.  It looks like you are performing ops directly 
> against the cache tier instead of the base tier (assuming cache1 is your 
> cache pool).  Here are my steps against master where the object is 
> successfully promoted upon 'rbd info':
> 
> # ceph osd erasure-code-profile set teuthologyprofile 
> ruleset-failure-domain=osd m=1 k=2
> 
> # ceph osd pool delete rbd rbd --yes-i-really-really-mean-it
> pool 'rbd' removed
> 
> # ceph osd pool create rbd 4 4 erasure teuthologyprofile
> pool 'rbd' created
> 
> # ceph osd pool create cache 4
> pool 'cache' created
> 
> # ceph osd tier add rbd cache
> pool 'cache' is now (or already was) a tier of 'rbd'
> 
> # ceph osd tier cache-mode cache writeback
> set cache-mode for pool 'cache' to writeback
> 
> # ceph osd tier set-overlay rbd cache
> overlay for 'rbd' is now (or already was) 'cache'
> 
> # ceph osd pool set cache hit_set_type bloom
> set pool 2 hit_set_type to bloom
> 
> # ceph osd pool set cache hit_set_count 8
> set pool 2 hit_set_count to 8
> 
> # ceph osd pool set cache hit_set_period 60
> set pool 2 hit_set_period to 60
> 
> # ceph osd pool set cache target_max_objects 250
> set pool 2 target_max_objects to 250


set pool cache min_read_recency_for_promote 4

> # rbd -p rbd create test --size=1M
> 
> # for x in {0..10}; do rbd -p rbd info test > /dev/null 2>/dev/null ; done
> 
> # rados -p cache ls
> rbd_id.test
> test.rbd
> rbd_directory
> rbd_header.101944ba7335
> 
> # rados -p cache cache-flush rbd_id.test
> 
> # rados -p cache cache-evict rbd_id.test
> 
> # rados -p cache ls
> test.rbd
> rbd_directory
> rbd_header.101944ba7335
> 
> # rbd -p rbd info test
> rbd image 'test':
>       size 1024 kB in 1 objects
>       order 22 (4096 kB objects)
>       block_name_prefix: rbd_data.101944ba7335
>       format: 2
>       features: layering
>       flags: 

And then I get EOPNOSUPP too.

The problem is the get_id op does sync_read, which files.

I think Nick's suggestion is the right one: if we get EOPNOSUPP we force a 
promotion.  Not sure how tricky that will be to get right, though.  A 
workaround for rbd might be to put the info in an xattr instead of in 
the data payload.. that's probably more efficient anyway.

sage

> 
> # rados -p cache ls
> rbd_id.test
> test.rbd
> rbd_directory
> rbd_header.101944ba7335
> 
> -- 
> 
> Jason Dillaman 
> Red Hat Ceph Storage Engineering 
> [email protected] 
> http://www.redhat.com 
> 
> 
> ----- Original Message -----
> > From: "Nick Fisk" <[email protected]>
> > To: "Sage Weil" <[email protected]>, "Samuel Just" <[email protected]>
> > Cc: "Jason Dillaman" <[email protected]>, [email protected], 
> > [email protected]
> > Sent: Thursday, February 11, 2016 12:46:38 PM
> > Subject: RE: cls_rbd ops on rbd_id.$name objects in EC pool
> > 
> > Hi Sage,
> > 
> > Do you think this will get fixed in time for the Jewel release? It still
> > seems to happen in Master and is definitely related to the recency setting.
> > I'm guessing that the info command does some sort of read and then a write.
> > In the old behaviour the read would have always triggered a promotion?
> > 
> > 
> > nick@Ceph-Test:~$ ceph osd pool get cache1 min_read_recency_for_promote
> > min_read_recency_for_promote: 8
> > nick@Ceph-Test:~$ ceph osd pool get cache1 min_write_recency_for_promote
> > min_write_recency_for_promote: 8
> > nick@Ceph-Test:~$ rbd -p cache1 create Test99 --size=10G
> > nick@Ceph-Test:~$ rbd -p cache1 info Test99
> > rbd image 'Test99':
> >         size 10240 MB in 2560 objects
> >         order 22 (4096 kB objects)
> >         block_name_prefix: rbd_data.e8e734689a5e
> >         format: 2
> >         features: layering
> >         flags:
> > nick@Ceph-Test:~$ rados -p cache1 cache-flush rbd_id.Test99
> > nick@Ceph-Test:~$ rados -p cache1 cache-evict rbd_id.Test99
> > nick@Ceph-Test:~$ rbd -p cache1 info Test99
> > 2016-02-11 17:39:40.942030 7f0006eb3700 -1 librbd::image::OpenRequest: 
> > failed
> > to retrieve image id: (95) Operation not supported
> > 2016-02-11 17:39:40.942205 7f00066b2700 -1 librbd::ImageState: failed to 
> > open
> > image: (95) Operation not supported
> > rbd: error opening image Test99: (95) Operation not supported
> > nick@Ceph-Test:~$ ceph osd pool set cache1 min_read_recency_for_promote 0
> > set pool 12 min_read_recency_for_promote to 0
> > nick@Ceph-Test:~$ rbd -p cache1 info Test99
> > rbd image 'Test99':
> >         size 10240 MB in 2560 objects
> >         order 22 (4096 kB objects)
> >         block_name_prefix: rbd_data.e8e734689a5e
> >         format: 2
> >         features: layering
> >         flags:
> > 
> > 
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Nick Fisk [mailto:[email protected]]
> > > Sent: 05 February 2016 19:58
> > > To: 'Sage Weil' <[email protected]>; 'Samuel Just' <[email protected]>
> > > Cc: 'Jason Dillaman' <[email protected]>; [email protected];
> > > [email protected]
> > > Subject: RE: cls_rbd ops on rbd_id.$name objects in EC pool
> > > 
> > > > -----Original Message-----
> > > > From: [email protected] [mailto:ceph-devel-
> > > > [email protected]] On Behalf Of Sage Weil
> > > > Sent: 05 February 2016 18:45
> > > > To: Samuel Just <[email protected]>
> > > > Cc: Jason Dillaman <[email protected]>; Nick Fisk <[email protected]>;
> > > > [email protected]; [email protected]
> > > > Subject: Re: cls_rbd ops on rbd_id.$name objects in EC pool
> > > >
> > > > On Fri, 5 Feb 2016, Samuel Just wrote:
> > > > > On Fri, Feb 5, 2016 at 7:53 AM, Jason Dillaman <[email protected]>
> > > > wrote:
> > > > > > #1 and #2 are awkward for existing pools since we would need a
> > > > > > tool to inject dummy omap values within existing images.  Can the
> > > > > > cache tier force-promote it from the EC pool to the cache when an
> > > > > > unsupported op is encountered?  There is logic like that in
> > > > > > jewel/master for handling the proxied writes.
> > > >
> > > > That sounded familiar but I couldn't find this in the code or history
> > > > between infernalis and master.  And then I went back and was unable to
> > > > reproduce the a problem on either infernalis branch or v9.2.0.
> > > >
> > > > Nick, I was doing
> > > >  1013  ./rbd -p ec create foo --size 10
> > > >  1014  ./rbd -p ec info foo
> > > >  1015  ./rados -p ec-cache cache-flush rbd_id.foo
> > > >  1016  ./rados -p ec-cache cache-evict rbd_id.foo
> > > >  1017  ./rbd -p ec info foo
> > > >  1018  ./rados -p ec-cache ls -
> > > >
> > > > The rbd.get_id is successfully forcing a promotion.
> > > >
> > > > Which makes me think something else is going on... Nick, can you try
> > > > to reproduce this with a userspace librbd client?  'rbd info' will do
> > > > a few basic operations, but if that isn't problematic, try 'rbd
> > > > bench-write' or 'rbd export', which will do real IO?
> > > 
> > > Hi Sage,
> > > 
> > > Just tried again and I can confirm its definitely not working, but I think
> > > I may
> > > have stumbled on the reason why.
> > > 
> > > First apologies for not mentioning it before, but I am still running that
> > > recency
> > > fix on Infernalis. Initially I thought this was a flushing issue as I just
> > > assumed
> > > those objects shouldn't get flushed out at all. But after reading your
> > > email
> > > where you said it forced the promotion, it struck me that the broken
> > > recency
> > > behaviour may have been masking this issue. With the fix it would only
> > > promote if the object was hot enough, which it probably in most cases
> > > wouldn't be. As a test I set my recency's down to 0 and tried the steps
> > > above
> > > again and this time it worked. Does this make sense?
> > > 
> > > Nick
> > > 
> > > >
> > > > sage
> > > >
> > > >
> > > > > -Sam
> > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jason Dillaman
> > > > > >
> > > > > > ----- Original Message -----
> > > > > >> From: "Sage Weil" <[email protected]>
> > > > > >> To: "Nick Fisk" <[email protected]>
> > > > > >> Cc: "Jason Dillaman" <[email protected]>,
> > > > > >> [email protected], [email protected]
> > > > > >> Sent: Friday, February 5, 2016 10:42:17 AM
> > > > > >> Subject: cls_rbd ops on rbd_id.$name objects in EC pool
> > > > > >>
> > > > > >> On Wed, 27 Jan 2016, Nick Fisk wrote:
> > > > > >> >
> > > > > >> > > -----Original Message-----
> > > > > >> > > From: ceph-users [mailto:[email protected]]
> > > > > >> > > On Behalf Of Jason Dillaman
> > > > > >> > > Sent: 27 January 2016 14:25
> > > > > >> > > To: Nick Fisk <[email protected]>
> > > > > >> > > Cc: [email protected]
> > > > > >> > > Subject: Re: [ceph-users] Possible Cache Tier Bug - Can
> > > > > >> > > someone confirm
> > > > > >> > >
> > > > > >> > > Are you running with an EC pool behind the cache tier? I know
> > > > > >> > > there was an issue with the first Infernalis release where
> > > > > >> > > unsupported ops were being proxied down to the EC pool,
> > > > > >> > > resulting in that same error.
> > > > > >> >
> > > > > >> > Hi Jason, yes I am. 3x Replicated pool on top of an EC pool.
> > > > > >> >
> > > > > >> > It's probably something similar to what you mention. Either the
> > > > > >> > client should be able to access the RBD header object on the
> > > > > >> > base pool, or it should be flagged so that it can't be evicted.
> > > > > >>
> > > > > >> I just confirmed that the rbd_id.$name object doesn't have any
> > > > > >> omap, so from rados's perspective, flushing and evicting it is
> > > > > >> fine.  But yeah, the cls_rbd ops aren't permitted in the EC pool.
> > > > > >>
> > > > > >> In master/jewel we have a cache-pin function that prevents an
> > > > > >> object from being flushed.
> > > > > >>
> > > > > >> A few options are:
> > > > > >>
> > > > > >> 1) Have cls_rbd cache-pin it's objects.
> > > > > >>
> > > > > >> 2) Have cls_rbd put an omap key on the object to indirectly do
> > > > > >> the
> > > > same.
> > > > > >>
> > > > > >> 3) Add a requires-cls type object flag that keeps hte object out
> > > > > >> of an EC pool *until* it eventually supports cls ops.
> > > > > >>
> > > > > >> I'd lean toward 1 since it's simple and explicit, and when we
> > > > > >> eventually make classes work we can remove the cache-pin behavior
> > > > from cls_rbd.
> > > > > >> It's harder to fix in infernalis unless we also backport
> > > > > >> cache-pin/unpin ops, too, so maybe #2 would be a simple
> > > > > >> infernalis
> > > > workaround?
> > > > > >>
> > > > > >> Jason?  Sam?
> > > > > >> sage
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > >
> > > > > >> > > Jason Dillaman
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > ----- Original Message -----
> > > > > >> > > > From: "Nick Fisk" <[email protected]>
> > > > > >> > > > To: [email protected]
> > > > > >> > > > Sent: Wednesday, January 27, 2016 8:46:53 AM
> > > > > >> > > > Subject: [ceph-users] Possible Cache Tier Bug - Can someone
> > > > > >> > > > confirm
> > > > > >> > > >
> > > > > >> > > > Hi All,
> > > > > >> > > >
> > > > > >> > > > I think I have stumbled on a bug. I'm running Infernalis
> > > > > >> > > > (Kernel 4.4 on the
> > > > > >> > > > client) and it seems that if the RBD header object gets
> > > > > >> > > > evicted from the cache pool then you can no longer map it.
> > > > > >> > > >
> > > > > >> > > > Steps to reproduce
> > > > > >> > > >
> > > > > >> > > > rbd -p cache1 create Test --size=10G rbd - p cache1 map
> > > > > >> > > > Test
> > > > > >> > > >
> > > > > >> > > > /dev/rbd1  <-Works!!
> > > > > >> > > >
> > > > > >> > > > rbd unmap /dev/rbd1
> > > > > >> > > >
> > > > > >> > > > rados -p cache1 cache-flush rbd_id.Test rados -p cache1
> > > > > >> > > > cache-evict rbd_id.Test rbd - p cache1 map Test
> > > > > >> > > >
> > > > > >> > > > rbd: sysfs write failed
> > > > > >> > > > rbd: map failed: (95) Operation not supported
> > > > > >> > > >
> > > > > >> > > > or with the rbd-nbd client
> > > > > >> > > >
> > > > > >> > > > 2016-01-27 13:39:52.686770 7f9e54162b00 -1
> > > > > >> > > > asok(0x561837b88360)
> > > > > >> > > > AdminSocketConfigObs::init: failed:
> > > > AdminSocket::bind_and_listen:
> > > > > >> > > > failed to bind the UNIX domain socket to
> > > > > >> > > > '/var/run/ceph/ceph-client.admin.asok': (17) File exists
> > > > > >> > > > 2016-01-27 13:39:52.703987 7f9e32ffd700 -1
> > > > librbd::image::OpenRequest:
> > > > > >> > > > failed to retrieve image id: (95) Operation not supported
> > > > > >> > > > rbd-nbd: failed to map, status: (95) Operation not
> > > > > >> > > > supported
> > > > > >> > > > 2016-01-27 13:39:52.704138 7f9e327fc700 -1
> > > > > >> > > > librbd::ImageState: failed to open image: (95) Operation
> > > > > >> > > > not supported
> > > > > >> > > >
> > > > > >> > > > Nick
> > > > > >> > > >
> > > > > >> > > > _______________________________________________
> > > > > >> > > > ceph-users mailing list
> > > > > >> > > > [email protected]
> > > > > >> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> > > >
> > > > > >> > > _______________________________________________
> > > > > >> > > ceph-users mailing list
> > > > > >> > > [email protected]
> > > > > >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >
> > > > > >> > _______________________________________________
> > > > > >> > ceph-users mailing list
> > > > > >> > [email protected]
> > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >
> > > > > >> >
> > > > > >> --
> > > > > >> To unsubscribe from this list: send the line "unsubscribe
> > > > > >> ceph-devel" in the body of a message to [email protected]
> > > > > >> More majordomo info at
> > > > > >> http://vger.kernel.org/majordomo-info.html
> > > > > >>
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel" in the body of a message to [email protected]
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to [email protected] More
> > > majordomo
> > > > info at http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cls_rbd ops on rbd_id.$name objects in EC pool

Reply via email to